Scheduling and synchronization for multicore concurrency platforms by Agrawal, Kunal
Scheduling and Synchronization for Multicore Concurrency
Platforms
by
Kunal Agrawal
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
PhD in Computer Science and Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2009
@ Kunal Agrawal, MMIX. All rights reserved.
The author hereby grants to MIT permission to reproduce and distribute publicly
paper and electronic copies of this thesis document in whole or in part.
Author ........... ....................
Department of Electrical Engineering and Computer Science
September 1,2009
Certified by ......
Charles E. Leiserson
Professor
Thesis Supervisor
Accepted by...
/ Terry P. Orlando
Chairman, Department Committee on Graduate Theses
ARCHIVES
MASSACHUSETTS INS
OF TECHNOLOGY
SEP 3 0 2009
LIBRARIES

Scheduling and Synchronization for Multicore Concurrency Platforms
by
Kunal Agrawal
Submitted to the Department of Electrical Engineering and Computer Science
on August 20, 2009, in partial fulfillment of the requirements
for the degree of
PhD in Computer Science and Engineering
Abstract
Developing correct and efficient parallel programs is difficult since programmers often have to
manage low-level details like scheduling and synchronization explicitly. Recently, however, many
hardware vendors have been shifting towards building multicore computers. This trend creates an
enormous pressure to create concurrency platforms - platforms that provide an easier interface
for parallel programming and enable ordinary programmers to write scalable, portable and efficient
parallel programs. This thesis provides some provably-good practical solutions to problems that
arise in the implementation of concurrency platforms, particularly in the domain of scheduling and
synchronization.
The first part of this thesis describes work on scheduling of parallel programs written in dy-
namic multithreaded languages (such as Cilk, Hood etc.). These languages allow the programmer
to express parallelism of their code in a natural manner, while an automatic scheduler in the con-
currency platform is responsible for scheduling the program on the underlying parallel hardware.
This thesis presents designs to increase the functionality of these concurrency platforms. The sec-
ond part of the thesis presents work on transactional memory semantics and design. Transactional
memory (TM), has been recently proposed as an alternative to locks. TM provides a transactional
interface to memory. The programmers can specify their critical sections inside a transaction, and
the TM concurrency platform guarantees that the region executes atomically. One of the purported
advantages of TM over locks is that transactional code is composable. Most of the current TM
concurrency platforms do not support full composability, however. This thesis addresses two of the
composability problems in existing TM concurrency platforms.
Thesis Supervisor: Charles E. Leiserson
Title: Professor

Acknowledgments
First and foremost, I would like to express my deep gratitude to my wonderful adviser, Charles
E. Leiserson. On being accepted to the MIT PhD program, when I first talked to Charles about
working with him, he waved his hands around and talked to me about exciting ideas for about half
an hour. I didn't understand 99% of what he said and was too awed to ask. However, I liked Charles
and his ideas sounded grand. On the basis of this tenuous reasoning, I decided to work with him.
It was the best decision I ever made. I often still don't understand what he says, though now I ask.
His ideas still sound grand, though.
Charles' enthusiasm for research, teaching, and life in general, is boundless and infectious. I
have enjoyed every moment of working with him and he has taught me a lot starting from how to
come up with research ideas to how to get shapes to align in Powerpoint. As an adviser, he provides
just the right mix of guidance and independence to allow me to come into my own as a researcher.
I would like to thank past and present members and visitors to the Supercomputing Technolo-
gies Group for their support and comments. In particular, I have had conversations about various
topics both related and unrelated to research with Michael Bender, Sid Chatterjee, John Danaher,
Jeremy Fineman, Zardosht Kasheff, Bradley Kuszmaul, Edya Ladan Mozes, Jelani Nelson, Tim
Olsen, Gideon Stupp, Angelina Lee and Jim Sukha. Bradley has always been available with help
with various issues such as which machine to buy, how to solve a particularly tricky implementation
problem, etc. and his comments on presentations are invaluable. Michael Bender visited the group
for a year, and has been a wonderful friend and mentor ever since. Yuxiong and I collaborated
closely on my first project and I loved working with her. Jeremy Fineman, Angelina Lee and Jim
Sukha have been friends and collaborators for all of my graduate career. Jim and I have come to
understand each other's incomprehensible ramblings. I will miss walking over to his desk several
times a day and talking about all of my half-baked ideas.
I would also like to thank Yves Robert and Anne Benoit from ENS-Lyon. We started an incred-
ibly fruitful collaboration after talking for a couple of hours and my meetings with them are always
amazingly productive. I had a wonderful time while visiting them at Lyon, and hope to go back, if
only for the fondue.
I would like to express my gratitude towards all the professors I TA'd with: Manolis Kellis, Srini
Devadas, Erik Demaine, Ron Rivest and Charles. I thoroughly enjoyed all of my TA experiences
and learnt a lot about teaching from all of them. I am sure that I will be a better teacher because of
them. In addition, I would like to thank Cynthia Skier for giving me the opportunity to teach at the
MIT EECS Women's Technology Program. The summer I taught at WTP was the best summer I
have ever had. My TA experiences and the WTP experience convinced me that I want teaching to
be a part of my career.
I was fortunate to have great friends to spend time with. I'd like to thank Harr Chen, Pallavi
Kaushik, Yuanzhen Li, Ali Mohammad, Artessa Saldivar-Sali, Lavanya Sharan, Neha Soni, Bill
Thies, Katy Thorn. Special thanks to my roommates over the years who helped me remain sane.
When life was difficult, as it invariably sometimes was, I knew that I could go home and talk to
someone who would listen.
Finally, I am eternally grateful to my wonderful parents and my brother for loving me and
supporting me. From my parents, I have always received encouragement without any pressure to
succeed. My brother is also one of my best friends. Although I haven't always shown it over the
last few years, I love them more than I can say.

Contents
1 Introduction
1.1 Dynamic Multithreading Background .......
1.2 Background on Transactional Memory......
1.3 Outline and Contributions ..............
2 Adaptive Scheduling with Parallelism Feedback
2.1 Background and Motivation ............
2.2 Scheduling Model and Results ...........
2.3 The Adaptive Greedy Algorithm ...........
2.4 Adaptive Work Stealing ...............
2.5 Trim Analysis of A-GREEDY for Unit Quanta ...
2.6 Trim Analysis of A-GREEDY for the General Case
2.7 Trim Analysis of A-STEAL .............
2.8 Related Work ....................
3 Experimental Evaluation of Adaptive Work Stealing with Parallelism Feedback
3.1
3.2
3.3
3.4
3.5
3.6
3.7
Summary of Experiments . . . . ....
Simulation Setup . ............
Time Experiments ............
Waste Experiments . ............
Time-Waste Experiments . . . . . . . ..
Utilization Experiments . . . . . . ...
Related Work ................
4 Library for Dag Evaluations in Cilk++
4.1 Motivation and Results ........ .
4.2 DAGEVAL: A dag Evaluation Library
4.3 Analysis of Performance ........
4.4 Experimental Setup . ..........
4.5 Dynamic Programming Application . .
4.6 Random Dag Microbenchmark .....
4.7 Future Work ...............
51
. . . 51
. . . . . 52
. . . . . . 54
. . . . . 56
. . . . . 57
.... . . . 58
. . . . . 59
65
. . . . . 65
.... . . . 68
. . . . . 70
. . . . . 76
. . . . . . 77
. . . . . . 84
. . . . .. . 85
. . .
5 Region Helper Locks 89
5.1 Motivating Example ....... . ... .. ....... ..... ... 90
5.2 Design for Helper Locks ................... .......... . 93
5.3 Completion Time and Space Usage . . . . . . . . . ... . . . . . . ....... 96
5.4 Prototype Implementation ... . . . . . . . . . . ........... 105
5.5 Hash Table Benchmark ............ ................... 107
5.6 Related Work ......... ...... ................. . 109
6 Memory Models for Transactions 111
6.1 Background and Motivation ................... . . . . . . 111
6.2 Transactional Computation Tree Framework . . . . . . . . . . . . . . ... . 112
6.3 Transactional Sequential Consistency .... . . . . . . . . . . . . . . . . . 116
6.4 Transactional Memory Models . . .... ........ ............... 120
6.5 Distinctness of the Models . ............. ................ 122
6.6 Related Work .... .. ........................ ..... . . 130
7 Semantics of Open-Nested Transactions 133
7.1 Subtleties of Open Nesting ........ . . . . . . . . . . . . . . . ... .. 133
7.2 The Operational Model . ............. .. . ... ........ 136
7.3 Prefix Race-Freedom of the Operational Model . . . . . . . . . . . . . . . ... . . 138
7.4 Discussion .... . . . . . . . . . ................ . . . 142
8 Safe Open-Nested Transactions Through Ownership 143
8.1 Contributions .... . . . . . . . . ..... ................ . 143
8.2 Ownership-Aware Transactions ....... . . . . . . . . . . . . . . . .. .. 144
8.3 Ownership Types for Xmodules ....... . . . . . . . . . . . . . . . ... . 149
8.4 Computations with Xmodules .......... .. ...... ......... 155
8.5 The OAT Model ....... ....... ............. ...... 156
8.6 Serializability by Modules ................... . . . . . . . . ... . 161
8.7 Deadlock Freedom ........ . ...... ............... 169
8.8 Related Work .................. .. ................ 172
9 Nested Parallelism within Transactions 175
9.1 Motivation and Results . . . . . ... .............. . . . . . . . . . 175
9.2 CWSTM Framework .................. .............. 178
9.3 CWSTM Semantics ....... ... . . .. ... . ........ 180
9.4 ANaiveTM ....... ............. .. .......... . 182
9.5 CWSTM Overview ........ ... .... ... .. ....... 183
9.6 CWSTM Conflict Detection ....... ........... .......... 187
9.7 Trace M aintenance .......... . ... .. .. .. ........ . 193
9.8 Highest Active Transaction .. . . . . . . . . . . . . . . . . . . . . . . . 194
9.9 Supertraces . . .. . . . . . . . . . . . . . . .... . .. . . . .. . . . . . . 195
9.10 Ancestor Queries ... . . . . . . . . . . ............... . . 196
9.11 Performance Claims ....... . ..... .......... ..... . 198
9.12 Discussion ........ .... .. . .............. ...... . 199
... . .......... 
9.13 Related Work .......... .. ....... ................... 200
10 Conclusions and Future Work 201
Appendices 204
.1 ON Model and Sequential Consistency ................... .... 204
.2 The OAT Model and Sequential Consistency ................... . 205
.3 Rules for Type Checking the OAT Type System . .................. 210

List of Figures
1.1 Proliferation ofmulticores .........................
1.2 Concurrency Platforms ...........................
1.3 Example for nested transactions ......................
1.4 Thesis Organization. ..............................
2.1 Pseudocode for A-GREEDY .........................
3.1 The parallelism profile (for 2 iterations) of the jobs used in the simulation.
3.2 Mean availability and trimmed mean availability ............
3.3 Comparison of theoretical and practical waste ...............
3.4 How waste varies with parallelism ......................
3.5 Time and waste of A-STEAL vs ABP for large machines .........
3.6 Time and waste of A-STEAL vs ABP for medium sized machines .
3.7 Comparing the utilization over time of A-STEAL+DEQ and ABP+EQ ..
4.1 Pseudocode for sequential dag evaluation .................
4.2 dag Evaluation in a fork-join parallel language, using locks. .........
4.3 Solving a DP problem using the dag evaluation library ..........
4.4 Pseudocode for dag evaluation using an eager traversal. ...........
4.5 An execution dag for COMPUTEANDNOTIFY*(D). ............
4.6 Execution dag for EXPAND* (B) . .....................
4.7 Psuedocode for a parallel divide-and-conquer solution to DP problem .
Speedup comparison for N = 1000 and B = 16 .....
Speedup comparison for N = 5000 and B = 16 .....
Speedup comparison for N = 15000 and 13 = 16 ....
Effect of block size on running time .......
Overhead comparison ....................
Comparing the various types of traversals ........
Evaluating the effect of parallelism in nodes .......
Hash table example ................... .
Code for insert and resize for a concurrent hash table. ..
Hash table with helper locks . ..............
Deque pool example ....................
Deque chain example . ...................
Experiments on concurrent hash table with helper locks .
. . . 53
. . . . . 9 1
. . . . . . . . . . . . 92
. . . . . . 94
. . . . . 96
. . . . . . . . 106
. . . . . . . . . . . . 108
4.8
4.9
4.10
4.11
4.12
4.13
4.14
5.1
5.2
5.3
5.4
5.5
5.6
: : : : I _
6.1 Sample computation tree and computation dag . . . . . . . ...... . . . . . . 115
6.2 Example of sequential consistency .... . . . . . . . . . . . . . . . . . . . 117
6.3 Dependence graphs depicting distintness of models ............. . . . . 124
6.4 Example topological sorts ................... ........... 125
6.5 An example to distinguish the various memory models . .............. 128
6.6 Example race free dag ......... ..................... 129
6.7 Example prefix-race free dag . . . .......... .. ............. . . 130
7.1 Example to demonstrate open nesting ....... . . . . . . . . . . ... . 134
7.2 Inconsistency due to open nesting ... . . ......... . ........... . . 135
7.3 Flawed implementation of a hashtable ... . . . . . . . . . . . . . . . . . . 135
8.1 Example module tree ....... .......... ............ 148
8.2 Example code to specify modules ................ .......... 153
9.1 Example fork-join program ...... . . . . . . . . . . . . . . . . . . . . . 176
9.2 The series-parallel dag for Figure 9.1 . . . . . . . . . . . . . . . . . . . . . . . . 176
9.3 Adding transactions to a series-parallel program . .................. 177
9.4 A legend for computation-tree figures ... . . . . . . . . . . . . . . . .... . 179
9.5 Computation tree for the program in Figure 9.1 . . . . . . . . . . . . . .. . 179
9.6 Pseudocode for conflict detection ..... . . .... .. . ... .......... 184
9.7 Pseudocode for instrumenting memory accesses . . . . . . . . ....... . . 186
9.8 Cleanip code for aborted transactions ... . . . . . . . . . . . . . . . . . . . 187
9.9 Example demonstrating trace split after steals . . . . . . . . . . . . . . . . . . . 188
9.10 Pseudocode for the XConflict algorithm . . . . . . . . . . . . . . . . . . . . . . 191
9.11 The definition of arrows used to represent paths in Figures 9.12, 9.13 and 9.14. . . 191
9.12 The three schenarios for finding transactional ancestor . . . . . . . . . . . . . . 192
9.13 Possible scenarios when line 11 returns true . . . . . . . . . . . . . . . . . 192
9.14 The scenarios when line 15 returns true . ......... . ......... . 192
i. :I__/jii~i_~rl__ii__JLIL__r~n__rXl I_~~i~--I-I___~ _f-i-l--ii-_Li:ii_.iIlii ;iii_-i:i:~ i;;; - i-iil:-i l ~-~_;~I-llll-tlllli ~t -ll~i- i-_-i
Chapter 1
Introduction
Recently, due to the diminishing returns in uniprocessor performance, the hardware industry has
shifted towards producing multicore chips, where each chip contains multiple processing elements
or cores. The number of cores per chip has been increasing steadily (Figure 1.1), and most desktop
and laptops now have more than one core. Since multicores are parallel machines, programmers
must write parallel programs in order to utilize the full power of these machines.
Parallel programming is significantly more difficult than sequential programming, however. In
order to get good performance from parallel programs, programmers have to carefully engineer
their programs, often based on the particular characteristics of the target parallel machine. Pro-
grammers spend large amounts of time correcting and fine-tuning their programs. It is not incorrect
to say that writing parallel code is often like writing assembly-level code. Therefore, parallel pro-
gramming has been the exclusive domain of a small number of expert programmers. However,
with the advent of multicores, it is more desirable than ever to simplify parallel programming so
that ordinary programmers can write programs for multicore machines.
In order to write parallel programs using traditional methods such as pthreads, programmers
have to often design their own scheduler and perform synchronization using fine-grained locks.
Properly designed concurrency platforms can alleviate much of the programmers' burden, how-
ever. A concurrency platform is a software abstraction layer that coordinates, schedules and man-
ages resources and provides an interface for programmers to write parallel programs. Figure 1.2
shows the abstract model of a parallel machine. A parallel machine may provide multiple concur-
rency platforms, each designed for a particular application domain. With the increasing adoption of
multicore hardware, both the research community and the industry have realized that concurrency
platforms are required to ensure that all programmers can write software that utilizes the full ca-
pability of these multicore computers. Various parallel programming languages and libraries such
as MIT Cilk [BFJ+95], Cilkarts' Cilk++ [Art09], IBM's X10 [ESSO5], Sun's Fortress [ACH+07],
OpenMP [Boa08], and Intel's Thread Building Blocks [Rei07] are examples of concurrency plat-
forms.
Ideally, a concurrency platform should free programmers from the drudgery of handling low-
level implementation details of parallel programming, allowing them to concentrate on algorithm
design. In addition, a concurrency platform should provide good performance. This thesis pro-
vides some provably good practical solutions to problems that arise in the implementation of con-
currency platforms, primarily in the domain of scheduling and synchronization. In particular, this
thesis presents work on improving concurrency platforms that provide the interface of "dynamic
" Uniprocessor
A Multicores
SNvIdia
GPGPU
Intel
Tflops
Tilera A
Nlapra A
Raza Cavium
Raw XLR Octeon
A A A
Niagara A
Opteron 4P
AAA Xeon MP
Xbox360 A
PA8800 Optaen Tanglewood
Power4 A A AAAA A
PExtreme Power6
8086 286 386 486 Pentium P2 P3 Itanium n
S w * wAIm Io tanum 2
1985 1990 1995 2000 2005 20101970 1975 1980
Figure 1.1: The squares are the uniprocessors and the triangles are multicores. Recently, the industry has
been moving towards producing multicores. In addition, the number of cores in the multicores is increasing
rapidly.
512
256
128
64
32# of
16cores
8
4
2
1 4004 8080a m
~II II
Parallel Programs
Parallel API Parallel API
Figure 1.2: The logical view of a parallel machine with concurrency platforms. A machine may have more
than one concurrency platform for different types of parallel applications. A concurrency platform provides
a parallel API for parallel applications. In addition, it may provide services, such as automatic schedulers,
in order to make parallel programming easier.
multithreading" and "transactional memory."
This chapter is organized as follows: Section 1.1 provides background and some definitions on
dynamic multithreading and Section 1.2 provides background on transactional memory. Section 1.3
briefly describes the contributions and organization of this theses.
1.1 Dynamic Multithreading Background
Conventional concurrency platforms such as POSIX threads [Ins] or Java threads [GJSBOO], pro-
vide a way to structure a large-scale computation into interacting persistent threads. To obtain
scalable performance using persistent threads can be difficult, however, because these programs
are not adaptively parallel. If a programmer creates 10 threads, the program cannot effectively
use 11 or more processors, even if those resources become available. Moreover, if the processor
resources diminish to 9 or fewer, the multiplexing of threads onto the available resources can be
dramatically inefficient [BP98a]. Thus, to obtain scalability, many pthreaded programs use the
pthreads to implement a scheduler, such as a task-bag scheduler, complicating the direct expression
of the programmer's desired algorithm.
On the other hand, new programming abstractions like dynamic multithreading - exemplified
in languages like Cilk [BJK+95, FLR98], JCilk [DLLO5], Hood [ABP98], NESL [BG96], Fortress
[ACH+07], etc - provide concurrency platforms that allow the programmer to express the paral-
lelism and the structure of the program in a more natural manner. The programmer is encouraged
to express as much parallelism as she can, and the concurrency platform (including the compiler
and the runtime system) is responsible for scheduling the application on the target machine.
I I -I
A dynamic multithreaded jobs (also called a task-parallel job) are often modeled as dynamically
unfolding directed acyclic graphs (dags) [BL98, BL99, Blu95, BG96, BGM99, FTYZ90, HS91,
NB99, ST94] Each node in the dag represents a unit-time instruction, and an edge represents a
serial dependence between nodes. The assumption that each node is a unit time instruction is
primarily a simplifying assumption. A longer task can be modeled as a chain of short tasks. A
node (task) is ready to be executed when all its predecessors have been executed. Scheduling
of dynamic multithreaded jobs involves deciding which of the (potentially many) ready nodes to
execute on the available processors.
We can define two parameters for such jobs. The work T1 of the job corresponds to the total
number of nodes in the dag. The work of a job is the amount of time it takes to execute this job on
1 processor, since the tasks execute sequentially. The second parameter is the span or critical-path
length T,, which corresponds to the length of the longest chain of dependencies on the dag. The
span of a job is equal to the completion time of the job on an infinite number of processors.
The completion time of a job on P processors is at least max T1/P, T,. In this thesis, we
consider two types of schedulers for dynamic multithreaded jobs. The first scheduler is the greedy
scheduler [Gra69, Bre74], which completes ajob in T1/P + T, time. Therefore, a greedy sched-
uler's completion time is within a factor of 2 of the optimal completion time. The second sched-
uler we consider is a randomized work-stealing scheduler [BL98], which completes a job in
O(T 1/P + T,) time (within constant factor of optimal). In spite of a worse completion time
bound, work-stealing schedulers are often more desirable in practice than greedy scheduler since
they have a better space bound and lower overheads. Many concurrency platforms such as Cilk,
Cilk++, TBB, Fortress, X10, etc., employ randomized work-stealing schedulers.
1.2 Background on Transactional Memory
If two parallel threads access the same shared object concurrently, then these accesses must be
properly synchronized in order to ensure correct performance. If these accesses are not properly
synchronized, then the program is said to have a data race. Data races lead to nondeterministic
and unexpected program behavior. Conventionally, data races are prevented in parallel programs
via mutual-exclusion locks. Locks, however, introduce a host of difficulties. For example, to
avoid deadlock when locking multiple objects, the locks must be acquired in a consistent linear
order. This construct makes programming error-prone. In addition, every thread must grab a lock
before accessing a shared object, regardless of whether another thread is actually accessing the
object. Therefore, in the case where concurrent access to shared objects is rare, locks may intro-
duce unnecessary overhead. Locks represent an example of "low-level" programming, since the
programmer must manage the locks for all the shared objects in the system.
Transactional memory (TM) was proposed by Herlihy and Moss [HM] about 15 years ago as an
alternative to locks. Recently, many software [HLM03, MSH+06, MSSO5, SATH+06], hardware
[HWC+04, AAK+05, MBM+06] and hybrid [DFL+06, KCJ+06] TM systems have been proposed
and transactional memory has become an active area of research. The programmer simply declares
the critical region to be atomic, and concurrency platform with a TM interface makes sure that
all the instructions in the region appear to either have occurred atomically or not at all.
A TM system enforces atomicity by tracking the memory locations that each transaction in the
system accesses. Most TM implementations maintain a transaction readset and writeset, i.e., a
1 xbegin
2 x++;
3 y++;
4 xbegin
5 i++; -I
6 xend I
7 z++;
8 xend
Figure 1.3: A code example where transaction I is nested inside A. The xbegin and xend delimiters mark
the beginning and end of a transaction.
list of memory locations that a transaction has read from or written to, respectively. Typically, the
system reports a conflict between two transactions A and B if both transactions access the same
memory location and at least one of those accesses is a write. If A and B conflict, then TM aborts
one of the transactions, rolls back any changes the aborted transaction made to global memory, and
clears its readset and writeset. If a transaction completes without conflicting, then it is committed
and its changes become visible.
Most of the work in this thesis concerns itself with nesting of transactions. Nested transactions
arise when an outer transaction A in its body calls another transaction I. Figure 1.3 shows code for a
transaction A within which another transaction I is nested. The database community has produced
an extensive literature on nested transactions. Moss [Mos85] credits Davies [Dav73] with inventing
nested transactions, and he credits Reed [Ree78] as providing the first implementation of what we
now call closed transactions. Gray [Gra81] describes what we now call open transactions. The
terms "open" and "closed" nesting were coined by Traiger [Tra83] in 1983.
The TM literature discusses three types of nesting: flat, closed, and open. The semantics and
performance implications of each form of nesting can be understood through the example of Fig-
ure 1.3. If I isflat-nested inside A, then conceptually, A executes as if the code for I were inlined
inside A. With flat-nesting, I's reads and writes are added directly to the readset and writeset of A.
Thus, in Figure 1.3, if a concurrent transaction B tries to modify variable i while I is running, but
before I has committed, then if I aborts, it also causes A to abort (since i conceptually belongs to
the readset of A as well).
If I is closed-nested inside A (see, for example, [Mos85]), then conceptually, the operations of
I only become part of A when I commits. In Figure 1.3, if B tries to modify i and causes I to abort,
then the system only needs to abort and roll back I, but B need not abort A, because A has not
accessed location i yet. Thus, closed nesting generally allows for a more efficient implementation
compared with flat nesting, because closed nesting allows a nested transaction I to abort without
forcibly aborting its parent transaction A, as with flat nesting. If I commits, however, I's readset
and writeset are merged with A's readset and writeset. Thus, if B tries to modify i after I has
committed but before A commits, the system may still abort A.
Finally, if I is open-nested inside A (see [WS92, MCC+06, MH05, Mos06]), then conceptually,
the operations of I are not considered as part of A. When I commits, I's changes are made visible
I I
to any other transaction B immediately, in the scheme of [MCC+06],' independent of whether A
later commits or aborts. Thus, in Figure 1.3, B never aborts A, and B's access to variable i is never
added to A's readset or writeset. Open nested transactions have fewer conflicts than closed-nested
transactions, and allow for more concurrency. However, open-nested transactions break the strict
serializability guarantees of TM.
1.3 Outline and Contributions
Dynamic multithreading and transactional memory provide useful abstractions for concurrency
platforms. However, current concurrency platforms that provide these abstraction have certain
limitations. This thesis addresses some of these limitations, and this section briefly describes both
the limitations and the proposed solutions.
Adaptive scheduling
Most concurrency platforms for dynamic multithreaded languages do not address scheduling on
multiprogrammed parallel machines - parallel machines with many parallel jobs running on them
- very well. These concurrency platforms often use nonadaptive schedulers, where the scheduler
allots a fixed number of processors to the job for the job's lifetime. Nonadaptive schedulers are
often ineffective for three reasons. First, when a job starts, the programmer must decide how
many processors to allot to the job; this strategy burdens the programmers with analyzing the
job's parallelism. Second, if the job's parallelism changes during execution, then nonadaptive
schedulers' inflexibility can make them inefficient: jobs with small parallelism may waste processor
cycles if they are allotted more processors than they can use. Third, new jobs may not be able to
enter the system if most of the processors have already been allocated.
This thesis provides theoretical and empirical work on adaptive algorithms that continually
adjust a job's allotment according to its parallelism and the parallelism of other jobs in the system.
This work provides a basis for building concurrency platforms where programmers need not specify
how many processors must be used to run their program, thereby unburdening programmers from
analyzing the parallelism of the program.
For these multiprogrammed environments, my collaborators and I considered a two-level schedul-
ing model: a systemjob scheduler is responsible for deciding how many processors each job is al-
lotted, and each job has its own thread scheduler responsible for scheduling the individual threads
(or tasks) of the job on the allotted processors effectively. Our scheduling algorithms use paral-
lelism feedback - an estimate of the job's future parallelism - to request the "right" number of
processors for the job. Periodically, the thread scheduler provides parallelism feedback to the job
scheduler by requesting processors for the next interval. We designed thread schedulers that can
provide provably good parallelism feedback for dynamic multithreaded programs and task paral-
lel programs. In this thesis, we present two thread schedulers, A-GREEDY - based on greedy
scheduling - and A-STEAL - based on work-stealing.
In order to ensure the robustness of these thread schedulers, we make three harsh assumptions
in our theoretical analysis. First, we assume that the thread scheduler has no knowledge of the
' Several alternative policies for manipulating readsets and writesets were suggested in both [MH05, Mos06]. How-
ever, since then, the scheme described above has been used in most implementations, and therefore we do not discuss
the alternatives here.
i ~
job's future parallelism. Second, we assume that the job's parallelism can change dramatically and
frequently during execution and the job's future parallelism is not related to its past parallelism.
Third, and most importantly, we assume that the scheduler operates under an omniscient adversarial
environment. This environment knows all about the future and always makes decisions to hurt the
thread scheduler's effectiveness. To analyze thread schedulers' performance under these adversarial
conditions, we developed a new analytical technique called trim analysis, which allows us to prove
that our thread scheduler performs poorly on no more than a small number of time steps, exhibiting
near-optimal behavior on the vast majority.
More precisely, suppose that a job has work T1 and span T,. On a machine with P processors,
both A-GREEDY and A-STEAL complete the job in expected O(T 1/P + T" + L Ig P) time steps,
where L is the length of a scheduling quantum and P denotes the 0 (T, + L lg P)-trimmed avail-
ability. This quantity is the average of the processor availability over all but the O (T, +L lg P) time
steps having the highest processor availability. When the job's parallelism dominates the trimmed
availability, that is, P << TI/T,, the job achieves nearly perfect linear speedup. Conversely, when
the trimmed mean dominates the parallelism, the asymptotic running time of the job is nearly the
length of its span, which is optimal.
We measured the performance of A-STEAL on a simulated multiprocessor system using syn-
thetic workloads. For jobs with sufficient parallelism, our experiments confirm that A-STEAL
provides almost perfect linear speedup across a variety of processor availability profiles. We com-
pared A-STEAL with the ABP algorithm, an adaptive work-stealing thread scheduler developed by
Arora, Blumofe, and Plaxton [ABP98] which does not employ parallelism feedback. On moder-
ately to heavily loaded machines with large numbers of processors, A- STEAL typically completed
jobs more than twice as quickly than ABP, despite being allotted the same or fewer processors on
every step, while wasting only 10% of the processor cycles wasted by ABP.
This work was done in collaboration with Yuxiong He, Wenjing Hsu and Charles E. Leiser-
son, and appears in three conference papers [AHHL06, AHL06, AHL07] and one journal paper
[AHHL08].
Dag evaluation library
Most languages (such as Cilk [BFJ+95], Cilk++ [Art09]) and libraries (such as Thread Building
Blocks [Rei07]) that use work-stealing schedulers only allow programs which can be represented
by fork-join (or series-parallel) dags. Some parallel programs may be most easily represented by
non fork-join dags, however. These dags with arbitrary dependencies are tricky to express in fork-
join languages such as Cilk++, since the programmer must maintain additional state to enforce
dependencies that are not captured by the fork-join control flow of the program. My collaborators
and I designed a Cilk++ library, called DAGEVAL , that allows programmers to evaluate arbitrary
dags. In addition, the computation within each node of the dag can vary in load and can contain
parallelism. Furthermore, DAGEVAL requires no modification to the Cilk++ runtime, making the
techniques used applicable to other similar fork-join languages and libraries.
In DAGEVAL, we implement eager traversal of the dag. We prove that the eager traver-
sal strategy is asymptotically optimal for dags with constant indegree and outdegree. That is,
DAGEVAL completes a dag evaluation in O(7 1/P + T,) where T is the work and T, is the
critical path of the computation. In addition, to evaluate the empirical performance of eager traver-
sal, we implemented the dynamic program representing the Smith-Waterman algorithm [SW81],
an irregular dynamic program on a 2D grid which is used in computational biology. We find that
when dag nodes are mapped to reasonably-sized blocks, our library's eager traversal exhibits low
overhead and scales better than two other traversal strategies. In some cases, eager traversal even
manages to outperform a divide-and-conquer implementation of the same dynamic program.
This work was done in collaboration with Charles Leiserson and Jim Sukha.
Helper locks in dynamic-multithreaded languages
As mentioned in Section 1.1, work-stealing schedulers guarantee linear speedup and many con-
currency platforms employ randomized work-stealing schedulers. However, these guarantees on
completion time of work-stealing schedulers do not hold if the programs contain synchronization
such as locks. Therefore, these concurrency platforms can be inefficient when executing programs
that contain synchronization. We introduce the notion of region helper locks for such concurrency
platforms. A region helper lock protects a parallel subcomputation, called a parallel region. Pro-
grammers can use region helper locks to express parallelism inside locked critical sections. Region
helper locks allow programs with large critical sections to execute more efficiently, since they al-
low many processors to help complete these parallel regions. More specifically, say processor p
tries to acquire a helper lock f, and fails because some parallel region A protected by f is already
executing. Then, instead of blocking, p helps complete A.
We present a design of a work-stealing based concurrency platform, the parallel region lock
(PRL) runtime, which can execute computations augmented with helper locks and parallel regions.
The parallel region lock runtime allows unhounded nesting of parallel regions. We prove both
completion time and space usage bounds for our PRL design. In particular, assuming that a pro-
gram is "deadlock free," we show that the expected running time of a program on P processors is
O(TW/P + Tc + PN) where N is the number of parallel regions, W is the work of the computation
and T, is the "aggregate span", which is bounded by the sum of the spans of all the regions. For the
space bounds, we prove that PRL completes, a program using only 0 (PS1 ) stack space, where S
is the sum of serial stack space utilization over all parallel regions. Finally, we describe a prototype
of helper locks implemented in MIT Cilk. To demonstrate the feasibility of implementing PRL,
we use the prototype to implement a concurrent hash table with a resize operation protected by a
region helper lock.
Transactional computation framework
Even though there has been a lot of research in TM recently, semantics of certain TM mechanisms
are still poorly understood. Most TM designs are described using their implementation, and it is
often difficult to unravel the semantic implications of design decision from these descriptions. This
thesis presents a transactional computation framework inspired by Frigo and Luchanco's [FL98]
computation centric framework. This framework allows us to define transactional semantics in an
implementation independent manner. Our primary motivation for designing this framework was
to precisely define the semantics of "open-nested transactions." We have found, however, that this
framework is flexible enough to allows us to both perform a posteriori analysis of computations in
order to understand the semantics of a TM design, and to define operational behavior of new TM
designs.
Il/i__;iiiiii;~_l;;;i;~;ll_;~/:____j~ ~_;___ _~_ 1 __I I ~;^;i;
Using this model, we define the traditional model of serializability and two new transactional-
memory models, race freedom and prefix-race freedom. We prove that these three memory mod-
els are equivalent for transactional-memory systems that support only closed nesting, as long as
aborted transactions are "ignored." We prove that for systems that support open nesting, however,
the models of serializability, race freedom, and prefix race freedom are distinct.
This work was done in collaboration with Charles E. Leiserson and Jim Sukha and appears in
[ALSO6].
Open nesting and ownership-aware nesting
Open nested transactions allow more concurrency in the TM system, but they also make trans-
actions nonserializable. We use our transactional computation framework in order to show they
support a much weaker memory model, called prefix-race freedom. As a consequence of this
nonserializable behavior, if a transaction A has an open transaction B nested inside it, A may no
longer see a transactionally consistent view of memory. This behavior can be avoided by carefully
structuring A and B. This behavior also means, however, that if a function f containing an open
transaction is called from within a transaction T, then T (and all other transactions that call T, and
so on) must be aware of the fact that f has an open transaction so that T can be properly struc-
tured. Therefore, methods are no longer composable in a concurrency platform that supports open
nesting.
The idea behind open nesting is to ignore "low-level" memory operations of an open-nested
transaction when detecting conflicts for its parent transaction, and instead perform abstract concur-
rency control for the "high-level" operation that the nested transaction represents. Unfortunately,
in a concurrency platform that supports open nesting, the TM runtime is unaware of the different
levels of memory. Due to this, unconstrained use of open nesting leads to anomalous program
behavior.
My collaborators and I designed an alternative called ownership-aware transactional memory
which allows a more systematic and controlled form of open nesting. Ownership-aware transac-
tional memory incorporates the notion of modules into the TM system and requires that trans-
actions and data be associated with specific transactional modules or Xmodules. We propose a
new ownership-aware commit mechanism, a hybrid between an open-nested and closed-nested
commit which commits a piece of data differently depending on which Xmodule owns the data.
Moreover, we provide a set of precise constraints on interactions and sharing of data among the
Xmodules based on familiar notions of abstraction. The ownership-aware commit mechanism and
these restrictions on Xmodules allow us to prove that ownership-aware TM has clean memory-level
semantics. In particular, it guarantees serializability by modules, an adaptation of the definition of
multilevel serializability from database systems. In addition, we describe how a programmer can
specify Xmodules and ownership in a Java-like language. Our type system can enforce most of the
constraints required by ownership-aware TM statically, and can enforce the remaining constraints
dynamically. Finally, we prove that if transactions in the process of aborting obey restrictions on
their memory footprint, then ownership-aware TM is free from semantic deadlock. Therefore,
ownership aware transactions offer the same concurrency as open-nesting, but provide it in a safe
manner.
The research on semantics of open nested transactions was done jointly with Charles Leiserson
and Jim Sukha[ALS06]. The research on ownership aware transactions was done jointly with
Scheduling Synchronization
Figure 1.4: Thesis Organization.
Angelina I-Ting Lee and Jim Sukha [ALS09].
Nested parallelism in TM
Most implementations of transactional memory do not allow a transaction to call another method
if the callee creates new parallelism. Therefore a function containing parallelism (and transactions
to synchronize between its parallel threads) can not be called from within a transaction. Suppose
that a function B calls a function A from within a transaction. With most current TM implemen-
tations, this code stops working correctly if a serial implementation of A is replaced by a correct
parallel (and transactified) implementation of A with the same interface as the serial implementa-
tion. This behavior is clearly undesirable. Moreover, in programs using a dynamic-multithreaded
language like Cilk, adding transactions in a natural manner generates code with parallelism inside
transactions. It is unnatural to write code with no parallelism inside transactions when using such
languages.
My collaborators and I designed a provably efficient software transactional memory system
that allows parallelism inside transactions. This design is meant to add transactions to dynamic
multithreaded languages that generate series-parallel programs. We designed XConflict, a data
structure that facilitates conflict detection for a software transactional memory system which sup-
ports transactions with nested parallelism and unbounded nesting depth. For languages that use a
Cilk-like work-stealing scheduler, XConflict answers concurrent conflict queries in 0(1) time and
can be maintained efficiently. In particular, for a program with T1 work and a span (or critical-path
length) of T,, the running time on P processors of the program augmented with XConflict is only
O(T 1/P + PT,).
Using XConflict, we describe CWSTM, a concurrency platform design for software transac-
tional memory which supports transactions with nested parallelism and unbounded nesting depth
of transactions. In the restricted case when no transactions abort and there are no concurrent read-
ers, CWSTM executes a transactional computation on P processors also in time O(Ti/p + PT,).
Although this bound holds only under rather optimistic assumptions, this result is the first theoreti-
cal performance bound on a TM system that supports transactions with nested parallelism which is
independent of the maximum nesting depth of transactions.
This work was done in collaboration with Jeremy Fineman and Jim Sukha [AFSO8].
II I I I I I r I 1 111
Thesis organization
All of these contributions fall into the domain of scheduling and synchronization. Figure 1.4 shows
the contributions of the thesis in pictorial form. Chapter 2 provides the algorithm and the theoret-
ical results for adaptive scheduling, while Chapter 3 provides the experimental results. Chapter 4
provides work on dag evaluation. Chapter 5 provides work on parallel regions and helper locks.
Chapter 6 explains the transactional computation framework, while Chapter 7 uses this framework
to explain the semantics of open-nested transactions. Chapter 8 presents ownership-aware transac-
tional memory, and Chapter 9 presents CWSTM.

Chapter 2
Adaptive Scheduling with Parallelism
Feedback
In this chapter, we describe adaptive scheduling with parallelism feedback. The parallelism of
programs written using task parallel or dynamic multithreaded languages such as Cilk [BFJ+95],
Cilk++ [Art09], Intel's Thread Building Blocks [Rei07], etc. can change during the execution.
Most concurrency platforms use scheduling algorithms that are nonadaptive, however, where a
fixed number of processors is allotted to the job for its lifetime. In a multiprogrammed environment,
where a large number of jobs are executing on the same machine, nonadaptive schedulers are
inflexible, and become inefficient when the parallelism of the jobs changes, or new jobs enter the
system. This research concentrates on designing adaptive algorithms that use parallelism feedback,
an estimate of the job's future parallelism to request the "right" number of processors for the job
so as to minimize both the completion time of the job, and the waste of processing resources by the
job. These algorithms provide a basis for designing more effective concurrency platforms for task
parallel and dynamic multithreaded languages. This chapter also describes the theoretical analysis
of these algorithms. (Chapter 3 concentrates on the experimental evaluation.) In order to provide
robust results, our theoretical analysis assumes harsh adversarial conditions and we introduce a new
analytical technique called "trim analysis" in order to handle these conditions.
The chapter is organized as follows: Section 2.1 provides more background and motivation
for adaptive scheduling; Section 2.2 describes our scheduling model and results, and introduces
trim analysis; Sections 2.3 and 2.4 describe adaptive algorithms for greedy scheduling and work
stealing respectively; Sections 2.5, 2.6, and 2.7 describe the theoretical analysis of these algorithms,
and finally, Section 2.8 describes some related work.
2.1 Background and Motivation
The scheduling of a collection of parallel jobs onto a multiprocessor is an old and well-studied
topic of research [TLW+94, BEGCS74, DGBL96, DD96, Gu95, MVZ93, Edm99, LV90, Squ95,
TG89]. We consider so-called space-sharing [Fei97] for parallel jobs, where jobs occupy disjoint
processor resources, as opposed to time-sharing[Fei97], where different jobs may share the same
processor resources at different times. Space-sharing schedulers can be implemented using a two-
This is joint work with Yuxiong He, Wenjing Hsu and Charles E. Leiserson [AHHL06, AHL07].
level strategy [Fei97]: a kernel-level job scheduler which allots processors to jobs, and a user-level
thread scheduler which schedules the tasks belonging to a given job onto the allotted processors.
Most prior work on thread scheduling for multithreaded jobs deals with nonadaptive scheduling
[BL99, BGM95, Gra69, Bre74, BG96, NB99], where the job scheduler allots a fixed number of
processors to the job for its entire lifetime. For jobs whose parallelism is unknown in advance and
which may change during execution, this strategy may waste processor cycles [Squ95], because a
job with low parallelism may be allotted more processors than it can productively use. Moreover, in
a multiprogrammed environment, nonadaptive scheduling may not allow a new job to start, because
existing jobs may already be using most of the processors.
With adaptive scheduling [ABP98] (called "dynamic" scheduling in many papers), the job
scheduler can change the number of processors allotted to a job while the job is executing. Thus,
new jobs can enter the system, because the job. scheduler can simply recruit processors from the
already executing jobs and allot them to the new job. Unfortunately, as with a nonadaptive sched-
uler, this strategy may cause waste, because a job with low parallelism may still be allotted more
processors than it can productively use.
Therefore, we require an adaptive scheduling strategy where the thread scheduler provides par-
allelismfeedback to the job scheduler so that when a job cannot use many processors, those proces-
sors can be reallotted to jobs with ample need. Based on this parallelism feedback, the job scheduler
can change the allotment of processors to each job according to the availability of processors in the
current system environment and the job scheduler's administrative policy.
The question of how the job scheduler should partition the multiprocessor among the vari-
ous jobs has been studied extensively [DGBL96, DD96, Gu95, MPT93, MVZ93, Edm99, LV90,
RSSD95, RSD+94, YL01, MCN+00, ECBD03], but the administrative policy of the job scheduler
is not the focus of this work. Instead, we study the problem of how the thread scheduler provides
effective parallelism feedback to the job scheduler without knowing the future progress of the job,
the future availability of processors, or the administrative priorities of the job scheduler.
Various researchers [DGBL96, DD96, Gu95, MVZ93, YL01] have used the notion of instanta-
neous parallelism,' the number of processors the job can effectively use at the current moment, as
the parallelism feedback to the job scheduler. Although using instantaneous parallelism for paral-
lelism feedback is simple, it can cause gross misallocation of processor resources [SenO4]. For ex-
ample, the parallelism of a job may change substantially during a scheduling quantum, alternating
between parallel and serial phases. The sampling of instantaneous parallelism at a scheduling event
between quanta may lead the thread scheduler to request either too many or too few processors de-
pending on which phase is currently active, whereas the desirable request might be something in
between. Consequently, the job may systematically waste processor cycles on the one hand or take
too long to complete on the other.
Instead of using instantaneous parallelism, we use history-based strategies to provide paral-
lelism feedback. Our strategy provides parallelism feedback to the job scheduler based on a single
summary statistic and the job's behavior on the previous quantum. Even though our schedulers
provide parallelism feedback using only the past behavior of the job, and we assume that the job's
future parallelism can be completely uncorrelated with its history of parallelism, we prove shows
that they schedule the job well with respect: to both waste and completion time.
'These researchers actually use the general term "parallelism," but we prefer the more descriptive term.
2.2 Scheduling Model and Results
This section describes our scheduling model, and explains our results in detail. The scheduling
model describes the mechanics of the communication between the thread scheduler and the job
scheduler. In this section, we also explain our adversarial model of analysis and introduce a new
analytical technique, called trim analysis in order to handle this adversarial model.
Our scheduling model is as follows: Each job has its own thread scheduler, and the thread
scheduler operates in an online manner, oblivious to both the future characteristics of its job, and to
the other jobs in the system. We assume that time is broken into a sequence of equal-size schedul-
ing quanta 1, 2,..., each consisting of L time steps, and the job scheduler is free to reallocate
processors between quanta. The thread scheduler operates as follows. Between quanta q - 1 and
q, it determines its job's desire dq, which is the number of processors the job wants for quantum q.
The thread scheduler provides the desire dq to the job scheduler as its parallelism feedback. The
job scheduler follows some processor allocation policy to determine the processor availability pq
- the number of processors to which the job is entitled for the quantum q. The number of pro-
cessors the job receives for quantum q is the job's allotment (l, = iin {dq, p,}, the smaller of the
job's desire and the processor availability. Once ajob is allotted its processors, the allotment does
not change during the quantum. Consequently, the thread scheduler must do a good job before a
quantum of estimating how many processors it will need for all L time steps of the quantum, as
well as do a good job of scheduling the job on the allotted processors.
This chapter describes two adaptive thread schedulers, A-GREEDY and A-STEAL, which pro-
vide parallelism feedback. A-GREEDY is a greedy thread scheduler suitable for centralized schedul-
ing, where each job's thread scheduler can dispatch all the ready threads to the allotted processors in
a centralized manner, such as the scheduling of data-parallel jobs. A-STEAL is a distributed thread
scheduler, where each job is executed by decentralized work-stealing [BS81, Hal84, RSAU91,
BL99]. These thread schedulers complete the job in near-optimal time while guaranteeing low
waste.
Our theoretical analysis models the job scheduler as the thread scheduler's adversary, chal-
lenging the thread scheduler to be robust to the system environment and the job scheduler's ad-
ministrative policies. As with completion time results for nonadaptive greedy and work-stealing
schedulers', we provide results in terms of the work T, and the span T, of the job. As mentioned
in Chapter 1, nonadaptive greedy and randomized work-stealing schedulers guarantee that a job
completes in O(T 1/P + T,) time, which is within constant factor of optimal. In an adaptive set-
ting where the number of processors allotted to a job can change during execution, both T/P and
T are lower bounds on the running time, where P is the mean of the processor availability during
the computation. Therefore, one would like to prove the completion time of O (Ti /P + T) (which
is asymptotically optimal and analogous to the nonadaptive results). In the worst case, however,
an adversarial job scheduler can prevent any thread scheduler from providing good this completion
time. For example, if the adversary chooses a huge number of processors for the job's processor
availability just when the job has little instantaneous parallelism - the number of threads ready
to run at a given moment - no adaptive scheduling algorithm can effectively utilize the available
processors on that quantum. The adversary can therefore keep the availability low for all quanta
with small parallelism and high for all quanta with low parallelism. With this availability profile,
irrespective of parallelism feedback, the job will run slowly despite the mean parallelism being
high.2
We introduce a technique called trim analysis to analyze the time bound of adaptive thread
schedulers under these adversarial conditions. From the field of statistics, trim analysis borrows
the idea of ignoring a few "outliers." A trimmed mean, for example, is calculated by discarding
a certain number of lowest and highest values and then computing the mean of those that remain.
For our purposes, it suffices to trim the availability from just the high side. For a given value R, we
define the R-high-trimmed mean availability as the mean availability after ignoring the R steps
with the highest availability, or just R-triammed availability, for short. A good thread scheduler
should provide linear speedup with respect to an R-trimmed availability, where R is as small as
possible.
We prove that both A-GREEDY and A-STEAL guarantee linear speedup with respect to O(To +
L lg P)-trimmed availability. 3 Specifically, consider a job with work T and span To running on
a machine with P processors and a scheduling quantum of length L. A-STEAL completes the job
in expected O(T 1/P + T, + L lg P) time steps, where P denotes the O(To + L lg P)-trimmed
availability. Thus, the job achieves linear speed up with respect to the trimmed availability P when
the parallelism T /T, dominates P. In addition, we prove that the total number of processor cycles
wasted by the job is O(T), representing at most a constant-factor overhead.
2.3 The Adaptive Greedy Algorithm
This section presents the adaptive greedy thread scheduler A-GREEDY. Before each quantum, A-
GREEDY provides parallelism feedback to thejob scheduler based on the job's history of utilization
using a simple multiplicative-increase, multiplicative-decrease algorithm. A-GREEDY classifies
quanta as "satisfied" versus "deprived" and "efficient" versus "inefficient." Of the four possibili-
ties of classification, however, A-GREEDY only uses three: inefficient, efficient-and-satisfied, and
efficient-and-deprived. Using this three-way classification and the job's desire for the previous
quantum, it computes the desire for the next quantum. After the job scheduler allots aq processors
to the job for quantum q, A-GREEDY uses greedy scheduling [Gra69, Bre74] to schedule the ready
tasks on the allotted processors.
To classify a quantum q as satisfied versus deprived, A-GREEDY compares the job's allotment
aq with its desire dq. The quantum q is satisfied if aq = dq, that is, the job receives as many
processors as A-GREEDY requested on its behalf from the job scheduler. Otherwise, if aq < dq, the
quantum is deprived, because the job did not receive as many processors as A-GREEDY requested.
Classifying a quantum as efficient versus inefficient is more complicated. We define the usage
Uq of a quantum q as the amount of work completed by the job during the quantum, which is to
say, the total number of unit-time tasks in the dag that were completed during the quantum. The
maximum possible usage for a quantum q is Laq, where L is the length of quanta and aq is the
job's allotment for quantum q. A-GREEDY uses a utilization parameter 6, where 0 < 6 < 1, as
a threshold to differentiate between efficient and inefficient quanta. Typical values for 6 might be
90-95%. We call a quantum q efficient if up > 6 Laq, that is, the usage is at least a 6 fraction of the
2Using mean processor allotment instead of mean availability does not provide useful results. The trivial thread
scheduler that always requests (and receives) 1 processor can achieve perfect linear speedup with respect to its mean
allotment (which is 1) while wasting no processor cycles. By using a measure of availability, the thread scheduler must
attempt to exploit parallelism.
3The constants in the bound are different for the two schedulers.
A-GREEDY(q, 6, p)
1 if q = 1
2 then dq <- 1 > base case
3 elseif uq-1 < L6aq,_
4 then dq +-- dq_/p > inefficient
5 elseif aq_ = dq-1
6 then dq <- pdq-1 r> efficient-and-satisfied
7 else dq +- dq,1 > efficient-and-deprived
8 Report desire dq to the job scheduler.
9 Receive allotment aq from the job scheduler.
10 Greedily schedule on aq processors for L time steps.
Figure 2.1: Pseudocode for the adaptive greedy algorithm. A-GREEDY provides parallelism feedback to
a job scheduler in the form of a desire for processors. Before quantum q, A-GREEDY uses the previous
quantum's desire dq-1, allotment aq- 1, and usage Uq-:I to compute the current quantum's desire dq based on
the utilization parameter 6 and the responsiveness parameter p.
maximum possible usage, in which case the job wastes few (at most (1 - b)Laq) processor cycles.
We call a quantum q inefficient otherwise.
A-GREEDY calculates the desire dq of the current quantum q based on the previous desire dq,_
and the three-way classification of quantum q- 1 as inefficient, efficient-and-satisfied, and efficient-
and deprived. The initial desire is d, = 1. A-GRE,EDY uses a responsiveness parameter p > 1 to
determine how quickly the scheduler responds to changes in parallelism. Typical values of p might
range between 1.2 and 2.0. Figure 2.1 shows the pseudocode of A-GREEDY for one quantum.
The algorithm takes as input the quantum q, the utilization parameter 6, and the responsiveness
parameter p. Intuitively, it operates as follows:
* If quantum q - 1 was inefficient, A-GREEDY overestimated the desire. In this case, A-
GREEDY does not care whether the quantum is satisfied or deprived, and it decreases the
desire (line 4) in quantum q.
* If quantum q - 1 was efficient-and-satisfied, the job effectively utilized the processors that
A-GREEDY requested on its behalf. Thus, A-GREEDY speculates that the job can use more
processors and increases the desire (line 6) in quantum q.
* If quantum q - 1 was efficient but deprived, the job used all the processors it was allotted, but
A-GREEDY had requested more processors for the job than the job actually received from
the job scheduler. Since A-GREEDY has no evidence whether the job could have used all the
processors requested, it maintains the same desire (line 7) in quantum q.
Remarkably, this simple algorithm provides strong guarantees on waste and performance.
Greedy Scheduling
After the thread scheduler is allotted aq processors for quantum q, it uses greedy scheduling to
schedule the ready tasks on processors. Greedy scheduling operates as follows: At any time step,
if more than aq tasks are ready, then schedule any aq of them. If at most a, tasks are ready,
then schedule all of them. Therefore greedy scheduling is a centralized scheduler since the thread
scheduler must be aware of all the job's ready tasks.
2.4 Adaptive Work Stealing
This section presents the adaptive work-stealing thread scheduler A-STEAL. As with A-GREEDY,
before the start of a quantum, A-STEAL estimates processor desire based on the job's history of
utilization to provide parallelism feedback to the job scheduler. Instead of greedy scheduling,
however, A-STEAL uses randomized work stealing [BL99, ABP98, MKHJ90] to schedule the
job's ready tasks on the allotted processors. Unlike greedy scheduling, work stealing is a distributed
strategy and therefore has lower synchronization overheads.
A-STEAL can use any provably good work-stealing algorithm, such as that of Blumofe and
Leiserson [BL99] or the nonblocking one presented by Arora, Blumofe, and Plaxton [ABP98]. 4
In these work-stealing thread schedulers, every processor allotted to the job maintains a double-
ended queue, or deque, of ready threads for the job. When the current thread spawns a new thread,
the processor pushes the continuation of the current thread onto the top of the deque and begins
working on the new thread. When the current thread completes or blocks, the processor pops
the topmost thread off the deque to work on. If the deque of a processor is empty, however, the
processor becomes a thief, randomly picking a victim processor and stealing work from the bottom
of the victim's deque. If the victim has no available work, then the steal is unsuccessful, and the
thief continues to steal at random from the other processors until it is successful and finds work.
At all the time, every processor is either working or stealing.
Making work-stealing adaptive
This work-stealing algorithm must be modified to deal with dynamic changes in processor al-
lotment to the job between quanta. Two simple modifications make the work-stealing algorithm
adaptive.
Allotment gain: When the allotment increases from quantum q - 1 to q, the job scheduler obtains
aq - aq_1 additional processors. Since the deques of these new processors start out empty,
all these processors immediately start stealing to get work from the other processors.
Allotment loss: When the allotment decreases from quantum q - 1 to q, the job scheduler deallo-
cates aq-1 - aq processors, whose deques may be nonempty. To deal with these deques, we
use the concept of "mugging" [BLS]. When a processor runs out of work, instead of stealing
immediately, it looks for a muggable deque, a nonempty deque that has no associated proces-
sor working on it. Upon finding a muggable deque, the thief mugs the deque by taking over
4These algorithms impose some additional restrictions on the job, for example, that each node has an out-degree of
at most 2. Whatever restrictions assumed by the underlying work-stealing algorithm apply to A-STEAL as well.
I ~ ;;
the entire deque as its own. Thereafter, it works on the deque as if it were its own. If there are
no muggable deques, the thief steals normally. Data structures can be set up between quanta
so that stealing and mugging can be accomplished in 0(1) time [Sen04].
At all time steps during the execution of A-ST EAL, every processor is either working, stealing,
or mugging. We call the cycles that a processor spends on working, stealing, and mugging as
work-cycles, steal-cycles, and mug-cycles, respectively. We assume without loss of generality that
work-cycles, steal-cycles, and mug-cycles all take single time steps. We bound time and waste in
terms of these elementary processor cycles. Cycles spent stealing and mugging are wasted, and the
total waste is the sum of the number of steal-cycles and mug-cycles during the execution of the job.
A-STEAL 'S desire-estimation heuristic
The desire estimation algorithm for A-STEAL is similar to A-GREEDY's desire estimation. Again,
A-STEAL classifies the previous quantum as either "satisfied" or "deprived" and either "efficient"
or "inefficient." The classification of satisfied versus deprived is identical to A-GREEDY. However,
the classification of efficient versus inefficient is slightly different. Instead if usage (as with A-
GREEDY), A-STEAL uses nonsteal usage, which is the sum of the number of work-cycles and
mug-cycles. Although it might seem counter intuitive for the definition of "efficient" to include
mug-cycles, which, after all, are wasted, the rationale is that mug-cycles arise as a result of an
allotment loss and do not generally indicate that the job has a surplus of processors. Apart from
this difference, the desire estimation of A-STEAL is similar to that of A-GREEDY.
2.5 Trim Analysis of A-GREEDY for Unit Quanta
This section uses a trim analysis to analyze A-GREEDY for the special case where L = 1, that is,
where each quantum is a unit quantum consisting of only a single time step. For unit quanta and for
greedy schedulers, adaptive scheduling can be done efficiently using instantaneous parallelism as
feedback, since the scheduler knows exactly how many tasks are ready for the next step and request
exactly the right number of processors. However, this approach of using instantaneous parallelism
does not generalize to longer quanta, or to work-stealing schedulers. Surprisingly, A-GREEDY's
algorithm for desire estimation, which only uses historical information, provides nearly as good
time bounds as the approach that uses instantaneous parallelism even for unit quanta. Moreover,
these bounds can be extended to the case when L > 1 (Section 2.6) and to work-stealing schedulers
(Section 2.7). The analysis for unit quanta given in this section gives intuition for the effectiveness
of A-GREEDY's strategy for desire estimation.
For unit quanta, we shall prove that A-GREEDY with utilization parameter 6 = 1 completes a
job with work T and critical-path length T, in at most T < TL/P + 2T,+ log, P + 1 time steps,
where P is the number of processors in the machine and P is the (2T, + log, P + 1)-trimmed
availability. In contrast, a greedy thread scheduler that uses instantaneous parallelism as feedback
completes the job in at most T < Ti/P + T time steps, where P is the T,-trimmed availabil-
ity. Thus, even without up-to-date information on instantaneous parallelism, A-GREEDY operates
nearly as efficiently. Moreover, the total number of processor cycles wasted by A-GREEDY in the
course of the computation is bounded by pT1. (Instantaneous parallelism wastes none.)
To prove the completion-time bounds, we use a trim analysis. We label each quantum as either
accounted or deductible. Accounted quanta are those where Uq = pq, that is, the usage equals the
processor availability. The deductible quanta are those where Uq < pq. Our trim analysis will show
that when we ignore the relatively few deductible quanta, we obtain linear speedup on the more
numerous accounted quanta.
We first relate the labeling of accounted and deductible to the three-way classification of quanta
as inefficient, efficient-and-satisfied, and efficient-and-deprived.
Inefficient: In an inefficient quantum q, we have Uq < aq pq, that is, the job uses fewer
processors than it was allotted, and therefore it uses fewer processors than those available. Thus,
inefficient quanta are deductible quanta, irrespective of whether they were satisfied or deprived.
Efficient-and-satisfied: On an efficient quantum q, we have uq = aq. Since aq = in {pq, dq}
by definition, on a satisfied quantum, we have aq = dq 5 pq. Thus, we have Uq pq. Since we
cannot guarantee that u = pq, we assume pessimistically that quantum q is deductible.
Efficient-and-deprived: As before, on an efficient quantum q, we have Uq = aq. On a deprived
quantum, we have by definition that aq < d,, and since aq = min {pq, dq }, we have aq = pq. Thus,
we have uq = aq = pq, and quantum q is accounted.
Time Analysis
We prove the completion time bound of A-GREEDY by bounding the number of deductible and
accounted quanta separately. We use a potential function argument to prove that the number of
deductible quanta is at most 2T, + logP P + 1. We then show that the number of accounted quanta
is at most TI/PA, where T is the total work and P and PA is the mean availability on accounted
quanta. Thus, the total time to complete the job is at most T/PA + 2To + log, P + 1, which is
the sum of the number of accounted and deductible quanta, since each quantum consists of a single
time step. Finally, we show that PA > P, where P is the (2T, + L logP P+ 1)-trimmed availability,
which yields the desired result.
Our analysis uses a characterization of greedy scheduling based on whether the job uses all its
allotted processors on a given step. We define a step to be complete if the job uses all the allotted
processors in the step and incomplete if the job does not use all the available processors. In the
special case of A-GREEDY with unit quanta, an inefficient quantum consists of a single incomplete
step and an efficient quantum consists of a single complete step. The following lemma from the
literature [Blu95, BL99, EZL89] shows that: whenever a greedy scheduler (including A-GREEDY)
schedules an incomplete step, the job makes progress on its critical path.
Lemma 2.1 Any greedy scheduler reduces the length of ajob's remaining critical path by 1 after
every incomplete step. []
We next bound the maximum desire during the course of the computation.
Lemma 2.2 Suppose that A-GREEDY schedules ajob on a machine with P processors. If p is A-
GREEDY's responsiveness parameter, then for every quantum q, the job's desire satisfies dq 5 pP.
PROOF. We use induction on the number of quanta. The base case dl = 1 holds trivially. If a
given quantum q- 1 was inefficient, the desire d, decreases, and thus dq < dq-1 pP by induction.
If quantum q - 1 was efficient-and-satisfied, then dq = pdq_l = paq_ < pP. If quantum q - 1
was efficient-and-deprived, then dq dq- 1 < pP by induction. F
The deductible quanta for A-GREEDY are either inefficient or efficient and-satisfied. The next
lemma bounds their number.
Lemma 2.3 Suppose that A-GREEDY schedules ajob with critical-path length T, on a machine
with P processors. If p is A-GREEDY 's responsiveness parameter, 6 = 1 is its utilization param-
eter and L = 1 is the quantum length, then the schedule produces at most 2Tx + log, P + 1
deductible quanta.
PROOF. We use a potential-function argument based on the job's desire dq before quantum q.
Define the potential before quantum q to be
1(q) = 2T7 - log, d,
where T7 denotes the length of the remaining critical path before quantum q is executed, that is,
the length of the longest path in the unexecuted dag. The initial potential is
# (1) = 27-T,-log dl
27L,
since the desire in the first quantum is dl = 1. If the job executes for Q quanta, the final potential
is
(D(Q + 1) = 27- log, dQ+
> 0 - log(pP)
= - log, P - 1 ,
by Lemma 2.2. Since the potential starts at 2Tx and is at least - log, P - 1 at the end of the
computation, the total decrease of the potential is 4D (1) - '1(Q + 1) < 2T, + logo P + 1.
We now compute the decrease in potential during each quantum based on the three-way classi-
fication. Each case will use the fact that the decrease in potential during any quantum q is
A = - + (q) - (q :)
(2Tq - log , dq1 - (2Tj 1' - logp dq+l)
= 2(Tq - T + ) - (log dq - log,, d+l) .
Inefficient: An inefficient quantum q consists of a single incomplete step. After an incomplete
step, the length of the remaining critical path reduces by 1 (Lemma 2.1). Moreover, we have
dq+l = dq/p, since A-GREEDY reduces the desire after an inefficient quantum. Thus, the decrease
in potential after an inefficient quantum is
A1 = 2(T~ - T + ) - (log dq - logp dq+l)
= 2(TL - (T - 1))- (logp dq -logp(dq/p))
- 2(1) - (1)
- 1 .
Efficient-and-satisfied: A-GREEDY increases the desire after every efficient-and-satisfied quan-
tum (dq+i = pdq). The remaining critical-path length never increases. Thus, the decrease in
potential is
A(I = 2(T q - T7 1) - (logp dq - logpdq+l)
> 0 - (log1 dq - logp(pdq))
= 1.
Efficient-and-deprived: After efficient-and-satisfied quanta, A-GREEDY maintains the previous
desire (dq+l = dq), and, as before, the critical-path length never increases. Thus, the decrease in
potential is
S= 2(T - T ) - (logp dq - logp dq+l)
> 0.
Thus, the potential never increases, and it decreases by at least 1 after every deductible quantum.
Thus, the number of deductible quanta is at most 2T, + logp P + 1, the total decrease in potential.
El
We now bound the number of accounted quanta.
Lemma 2.4 Suppose that A-GREEDY schedules ajob with work T 1. If 6 1= is A-GREEDY's
utilization parameter and L = 1 is the quantum length, then the schedule produces at most T/ PA
accounted quanta, where PA is the mean availability on accounted quanta.
PROOF. Let A be the set of accounted quanta, and D be the set of deductible quanta. The mean
availability on accounted quanta is PA (1/ A ) EqeA Pq. The total number of tasks executed
over the course of the computation is T = EqeAUD Uq, since each of the T tasks is executed
exactly once in either an accounted or a deductible quantum. Since accounted quanta are those for
which Uq = pq, we have
T1  = Uq
qEAUD
> YU,
qEA
E Pq
qEA
= APA
Thus, the number of accounted quanta is lA I Ti/PA. OE
We can now bound the completion time of a job scheduled by A-GREEDY with unit quanta.
Theorem 2.5 Suppose that A-GREEDY schedules ajob with work T and critical-path length To
on a machine with P processors. If p is A-GREEDY's responsiveness parameter, 6 = 1 is its
utilization parameter, and L = 1 is the quantum length, then A-GREEDY completes thejob in
T < T 1/P+ 2To + logp P + 1
time steps, where P is the (2T, + logp P + 1)-trimmed availability.
I __;_~1 ;_i_ _ li_/__li--i-ll_~__ -i ... .. -- ...... . -------1-1(ili:-l ;- i-i--iii-~i :i i-li~:_I -
PROOF. The proof is a trim analysis. Let A be the set of accounted quanta, and D be the set of
deductible quanta. Lemma 2.3 shows that there are IDi 2T, + L Ig P + 1 deductible time steps,
since each quantum consists of a single time step. We have PA 2 P, since the mean availability on
the accounted time steps (we trim the IDI deductible steps) must be at least the (2T, + L Ig P + 1)-
trimmed availability (we trim the 2T, + L Ig P + 1 steps that have the highest availability). From
Lemma 2.4, the number of accounted quanta is ,41 < TI /PA T 1/P, and since T = L( AI + |DI),
the desired time bound follows. El
Waste Analysis
We now prove the waste bound for A-GREEDY with unit quanta. Let wq aq - Uq be the waste
of quantum q. In efficient quanta, the usage is u, = aq, and the waste is wq = 0. Therefore, the
job wastes processor cycles only on inefficient quanta. The next theorem shows that the waste on
inefficient quanta can be amortized against the work done on efficient quanta.
Theorem 2.6 Suppose that A-GREEDY schedules a job with work T on a machine. If p is A-
GREEDY's responsiveness parameter 6 = 1 is its utilization parameter and L = 1 is the quantum
length, then A-GREEDY wastes at most pT processor cycles in the course of its computation.
PROOF. We use a potential-function argument based on the job's desire dq before quantum q.
Define the potential (q) before quantum q as
{(q) = pT[' + P dq
p-1
where Tq is the total number of unexecuted tasks in the computation before quantum q. Thus, the
initial potential is
T (1) = pT1 + di
p- 1
= pT + pl/(p 
- 1),
since dl = 1. If the job executes for Q quanta, the final potential is
4 (Q + 1) = pT+ + P dQ+1p-1
= pli(p - 1) ,
since the desire dq of any quantum q is at least 1. Therefore the total decrease in potential is
4 (1) - A (Q + 1) pT 1.
Based on the three-way classification, we shall show that if the waste on quantum q is wq =
aq - uq, then the potential decreases by at least au during the quantum. Each way will use the fact
that the decrease in potential during any quantum q is
A Tq AF(q) - T(q + 1)
= pT + p-d - (pT+ + P dq+)
Sp(T -Tfq+l) + P(dq dqt).
p-1
Inefficient: For any quantum q, wq < aq, which is to say, the number of processor cycles wasted
is less than the total number of processor cycles allotted. Since the allotment is aq dq, we have
Wq < dq. After an inefficient quantum q, A-GREEDY reduces the desire to be dq+l - dq/p. Thus,
the decrease in potential is
A'q = p(Tf - Tfq ) +1 P (dq - d+)
p-
> P d, - dq/p)p-1
= dq
> Wq.
Efficient-and-satisfied: Since no processor cycles are wasted on any efficient quantum q, we
have Wq = 0 and the remaining work reduces by Uq = aq. On an efficient-and-satisfied quantum q,
the allotment is the same as the desire (aq = dq) and A-GREEDY increases the desire (dq+i - pdq)
after the quantum. Thus, the decrease in potential is
A'q = p(TT - Tf + ) (d+  - dq+l)
p-I
= paq + (dq - pdq)
= pdq 
- pdq
=0
= Wq.
Efficient-and-deprived: On any efficient quantum q, we have wq = 0 and the amount of remain-
ing work reduces by uq = aq. Since the quantum q is efficient-and-deprived, we have dq+l = dq,
because A-GREEDY maintains the previous desire. Therefore, the decrease in potential is
A = p(T - T + 1 ) + P (dq - dq+l)
p-1
= paq +O
> 0
= Wq.
In all three cases, if the job wastes wq processors in quantum q, the potential decreases by at
least wq. Consequently, the total waste during the course of the computation is at most pTi, the
total decrease in potential. D
2.6 Trim Analysis of A-GREEDY for the General Case
We now use a trim analysis to analyze the general case of A-GREEDY when each scheduling
quantum has L time steps, 6 is the utilization parameter, p is the responsiveness parameter, and P
is the number of processors in the machine. For a job with work T1 and critical-path length T,,
A-GREEDY achieves the following bounds on running time and waste, where P is the (2T,/(1 -
6) + L log, P + L)-trimmed availability:
T
,  
2Tx
T < + +Llog,P+L,
6P 1- LL
1+p-6W < 7.
As in Section 2.5, we label each quantum as either accounted or deductible. Recall that a
quantum q of length L and processor availability pq has a total of Lpq processor cycles available.
Accounted quanta are those for which 'uq > 6 Lp, that is, the job uses at least a 6 fraction of all
available processor cycles. The deductible quanta are those for which uq < L6pq. By the same
logic as in Section 2.5, inefficient quanta or efficient-and-satisfied quanta are labeled deductible.
Efficient-and-deprived quanta, on the other hand, are labeled accounted.
Time Analysis
We bound the accounted and deductible quanta separately. We first show how inefficient quanta
affect the remaining critical path of the job.
Lemma 2.7 A-GREEDY reduces the length of a job's remaining critical path by at least (1 - 6) L
after every inefficient quantum, where 6 is A-GREEDY 's utilization parameter and L is the quantum
length.
PROOF. The total number of tasks completed in an inefficient quantum q is less than 6 Laq. There-
fore, there can be at most 6L complete steps in an inefficient quantum, since on a complete step, the
job uses all the allotted processors, completing a,, tasks. Since there are L time steps in a quantum,
there are at least (1 - 6)L incomplete steps. Thus, the critical path reduces by at least (1 - 6)L,
since Lemma 2.1 shows that every incomplete step reduces the critical path by 1. O]
The next lemma bounds the number of deductible quanta.
Lemma 2.8 Suppose that A-GREEDY schedules a job with critical-path length T" on a machine
with P processors. If p is A-GREEDY's responsiveness parameter 6 is its utilization parameter,
and L is the quantum length, then the schedule produces at most 2T,/((1 - 6)L) + logp P + 1
deductible quanta.
PROOF. We use a potential-function argument as in Lemma 2.3. Define the potential before
quantum q as
(I(q) = 2Tq/(1 - 6)L - log d, ,
where T is the remaining critical path before quantum q. If the job completes in Q quanta, the
total decrease in potential is
2T - :
I - I0+1 = 1 (log, 1log, dQ +)
2T,
< - + log, P +1,(1 -6)1.
since dQ+1 < pP by Lemma 2.2.
We can compute the decrease in potential during each quantum based on the three-way classi-
fication. The decrease in potential in quantum q is
A = ()(q) - (q+l1)
= 2To/(1 - 6)L - log, dq - (2TO+/( - 6)L - log dq+i)
2S(1 - (T - T+ 1 ) - (logp dq- logp dq+l).
Inefficient: By Lemma 2.7, the critical path reduces by at least (1 - 6) L during an inefficient
quantum. Moreover, we have dq+l = dq/p, since A-GREEDY reduces the desire after an inefficient
quantum. Thus, the decrease in potential after an inefficient quantum is
2
S(1 -)(TA - T : ) - (logp dq - logp dq+l)
(16-)L
2(1 -6) (T -(T, - (1 - 6)L)) - (logp dq- log,(dq/p))
S2(1)- (1)
I 1.
Efficient-and-satisfied: A-GREEDY increases the desire after every efficient-and-satisfied quan-
tum (dq,+ = pdq). The remaining critical-path length never increases. Thus, the decrease in
potential is at least 1.
Efficient-and-deprived: After efficient-and-satisfied quanta, A-GREEDY maintains the previous
desire (dq+l = dq), and, as before, the critical-path length never increases. Thus, the potential does
not increase.
Thus, the potential never increases, and it decreases by at least 1 after every deductible quantum.
Thus, the number of deductible quanta is at most 2T,/((1 - 6)L) + logp P + 1, the total decrease
in potential. Ol
We now bound the number of accounted quanta.
Lemma 2.9 Suppose that A-GREEDY schedules a job with work TI. If 6 is A-GREEDY's uti-
lization parameter and L is the quantum length, then the schedule produces at most T1/6LPA
accounted quanta, where PA is the mean availability on accounted quanta.
PROOF. Let A be the set of accounted quanta, and let D be the set of deductible quanta. The mean
availability on accounted quanta is PA - (1/ A ) EqeA Pq. The total number of tasks executed in
the course of the computation is T1= -EqAUD Uq. Since the accounted quanta are those for which
~ _~_;_I___ 1_1 )XI_1 i~-ii7li --i~-~-~i;:l;l- ~l-^-:-------- --- ...... -, - ---------j_- If-_l-l-~i~ -I___j.,._ lii ;_1 iiil_)_ iil~i:-l(:;-
Uq 6 Lpq, we have
T 1 - N-uq
q:-AUD
> >jfq
q E A
= L, IA P 4 .
Therefore, the total number of accounted quanta is at most
JAI < Tl/6LPA. O
The next theorem provides the time bound for A-GREEDY.
Theorem 2.10 Suppose that A-GREEDY schedules ajob with work T and critical-path length T,
on a machine with P processors. If p is A-GREEDY's responsiveness parameter 6 is its utilization
parameter and L is the quantum length, then A-GREEDY completes the job in
T < T1/SP + 2T,/(1 - 6) + L log, P + L
time steps, where P is the (2T,/(1 - 6) + L log, P + L)-trimmed availability.
PROOF. The proof is a trim analysis. Let A be the set of accounted quanta, and D be the set of
deductible quanta. Lemma 2.8 shows that there are IDI 2T,/(1 - 6)L + log, P + 1 deductible
quanta, and hence at most L |DI = 2TO/(1 - 6) + L log, P + L time steps belong to deductible
quanta. We have that PA > P, since the mean availability on the accounted time steps (we trim the
L ID deductible steps) must be at least the (2T,/ 1 - 6) + L logp P + L)-trimmed availability (we
trim the 2T,/(1 - 6) + L log P + L steps that have the highest availability). From Lemma 2.9, the
number of accounted quanta is iAl T/6P4 < T,/6P, and since T = L(IAI + |DI), the desired
time bound follows. ]
Waste Analysis
We now prove the waste bound for A-GREEDY.
Theorem 2.11 Suppose that A-GREEDY schedules ajob with work T on a machine. If p is A-
GREEDY's responsiveness parameter, 6 is its utilization parameter and L is the quantum length,
then A-GREEDY wastes at most (1 + p - 6)T1/6 processor cycles in the course of its computation.
PROOF. We can prove the bound using a potential-function argument similar to the one presented
in Theorem 2.6. In this case the potential before quantum q is defined as
I(q) 6 + - Ldq,6 p-1
where Tq is the total number of unexecuted tasks in the computation before quantum q and dq is
the desire for quantum q.
Instead, here we use an accounting argument for better intuition about A-GREEDY'S waste. We
amortize the waste different quanta according to whether they are efficient or inefficient.
Inefficient: Based on Lemma 2.19, every inefficient quantum q maps to a unique efficient-and-
satisfied quantum r such that dr = dq/p. Therefore, the waste on the inefficient quantum q can
be amortized against the work done on efficient-and-satisfied quantum r. During the inefficient
quantum q, the waste is at most Wq < Laq. The work done on the corresponding efficient-and-
satisfied quantum is ur > 6La = 6Ld, - 6Ldq/p. Thus, the waste (less than Ldq) during
the inefficient quantum q is at most p/6 times of the work (at least 6Ldq/p) on its corresponding
efficient-and-satisfied quantum r. Thus, the total waste over all inefficient quanta is at most pT1/6,
which is at most p/6 times the work (at most T1) on efficient-and-satisfied quanta.
Efficient: The job completes at least L6a work on an efficient quantum with allotment aq, and
wastes at most (1 - 6 )Laq processor cycles. Therefore the total waste on efficient quanta is at most
((1 - 6)/6)T1, which is ((1 - 6)/6 times the work done in efficient quanta (at most T1).
Since the total waste is the sum of the waste on efficient and inefficient time steps, it is at most
((1 + p - 6)/6)T1. O
We can decompose the bounds of Theorems 2.10 and 2.11 into separate bounds for accounted
and deductible quanta.
Corollary 2.12 Suppose that A-GREEDY schedules ajob with work T1 and critical-path length T"
on a machine with P processors, and suppose that p is A-GREEDY's responsiveness parameter, 6
is its utilization parameter, and L is the quantum length. Let Ta and Td be the number of time steps
in accounted and deductible quanta, respectively, and let Wa and Wd be the waste on accounted
and deductible quanta, respectively. Then, A-GREEDY achieves the following bounds:
Ta K (1/6)T/P ,
Td < (2min{L,1/(1-6)})T,+LlogpP+L,
Wa (1/6- 1)Ti,
Wd _ (p/6)Ti.
As can be seen from these inequalities, the bounds for accounted quanta are stronger than
those for deductible quanta. The reason is, that the job scheduler in our model is adversarial. In
practice, however, it seems unlikely that the job scheduler would actually act as an adversary. Thus,
A-GREEDY's behavior on the deductible quanta is likely to be much better than these worst-case
bounds predict. Moreover, since the adversary's bad behavior is limited to relatively few deductible
quanta, we conjecture that in practice the overall time and waste of a real scheduler based on A-
GREEDY more closely follows the bounds for accounted quanta.
2.7 Trim Analysis of A-STEAL
This section uses a trim analysis to analyze A-STEAL with respect to both time and waste. Suppose
that A-STEAL schedules ajob with work T1 and span To on a machine with P processors. Let p de-
note A-STEAL's responsiveness parameter, 6 the utilization parameter, and L the quantum length.
We will show that A-STEAL completes thejob in time T = O (T 1/P + To + L Ig P + L In(1/E)
with probability at least 1 - c, where P denotes the O(T, + L lg P + L In(1/E))-trimmed avail-
ability. This bound implies that A-STEAL achieves linear speed-up on all the time steps excluding
at most 0 (T, + L Ig P + L In(1/E)) time steps with highest processor availability. Moreover, A-
STEAL guarantees that the total number of processor cycles wasted during the job's execution is
W = O(TI).
We prove these bounds using a trim analysis. We label each quantum as either accounted or de-
ductible. Accounted quanta are those with nq > IpAq, where nq denotes the nonsteal usage. That
is, the job works or mugs for at least a 6 fraction of the Lpq processor cycles possibly available dur-
ing the quantum. Conversely, the deductible quanta are those where nq < L6pq. Our trim analysis
will show that when we ignore the relatively few deductible quanta, we obtain linear speedup on
the more numerous accounted quanta. We can relate this labeling to a three-way classification of
quanta as inefficient, efficient-and-satisfied, and efficient-and-deprived:
* Inefficient: In an inefficient quantum q, we have nq < L 6 aq < L6 pq, since the allotment
aq never exceeds the availability pq. Thus, we label all inefficient quanta as deductible,
irrespective of whether they are satisfied or deprived.
* Efficient-and-satisfied: On an efficient quantum q, we have nq > Laq. Since we have
aq = min { p, dq}, for a satisfied quantum it follows that aq = dq < p,. Despite these two
bounds, we may nevertheless have nq < 1 6p,. Since we cannot guarantee that nq L6 pq,
we pessimistically label the quantum q as deductible.
* Efficient-and-deprived: As before, on an efficient quantum q, we have nq 2 L6aq. On a
deprived quantum, we have aq < d, by definition. Since aq m= in {q, dq }, we must have
aq = p,. Hence, it follows that nq > L 6 aq ::: Lpq, and we label quantum q as accounted.
Time analysis
We now analyze the execution time of A-STEAL by separately bounding the number of deductible
and accounted quanta. Two observations provide intuition for the proof. First, each inefficient
quantum contains a large number of steal-cycles, which we can expect to reduce the length of the
remaining span. This observation will help us to bound the number of deductible quanta. Second,
most of the processor cycles in an efficient quantum are spent either working or mugging. We will
show that there cannot be too many mug-cycles during the job's execution, and thus most of the
processor cycles on efficient quanta are spent doing useful work. This observation will help us to
bound the number of accounted quanta.
The following lemma, proved in Lemma 11 of [BL99], shows how steal-cycles reduce the
length of the job's span.
Lemma 2.13 Ifa job has r deques ofready thread, then 3r steal-cycles suffice to reduce the length
of the job 's remaining span by at least 1 with probability at least 1 - 1/e, where e is the base of the
natural logarithm. O
The next lemma shows that an inefficient quantum reduces the length of the job's span, which
we will later use to bound the total number of inefficient quanta.
Lemma 2.14 Let 6 denote A-STEAL's utilization parameter, and L the quantum length. Withprob-
ability greater than 1/4, A-STEAL reduces the length of a job 's remaining span in an inefficient
quantum by at least (1 - 6)L/6.
PROOF. Let q be an inefficient quantum. A processor with an empty deque steals only when
it cannot mug a deque, and hence, all the: steal-cycles in quantum q occur when the number of
nonempty deques is at most the allotment aX. Therefore, by Lemma 2.13, 3 aq steal-cycles suffice
to reduce the span by 1 with probability at least 1 - 1/e. Since the quantum q is inefficient, it
contains at least (1 - 6 )Laq steal-cycles. Divide the time steps of the quantum into rounds such
that each round contains 3 aq steal-cycles, except for possibly the last. Thus, there are at least
mn = (1 - 6 )Laq/3aq = (1 - 6)L/3 rounds.5 We call a round good if it reduces the length of
the span by at least 1; otherwise, the round is bad. For each round i in quantum q, we define
the indicator random variable X. to be 1 if round i is a bad round and 0 otherwise, and let X =
i1l Xi. Since we have Pr {Xi = 1} < 1/e, linearity of expectation dictates that E [X] < m/e.
We now apply Markov's inequality [CLRSO1, p. 1111], which says that for a nonnegative random
variable X, we have Pr {X > t} _ E [X]/t for all t > 0. Substituting t = m/2, we obtain
Pr {X > m/2} _ E [X] /(m/2) (m/e)/(m/2) = 2/e < 3/4. Thus, the probability exceeds
1/4 that quantum q contains at least m/2 good rounds. Since each good round reduces the span
by at least 1, with probability greater than 1/4, the span is reduced during quantum q by at least
m/2 = ((1 - 6)L/3)/2 = (1 - 6)L/6. []
Lemma 2.15 Suppose that A-STEAL schedules ajob with span T. on a machine. Let p denote A-
STEAL's responsiveness parameter, 6 the utilization parameter, and L the quantum length. Then,
for any E > 0, with probability at least 1 - , the schedule produces at most 48To/(L(1 - 6)) +
16 ln(1/c) inefficient quanta.
PROOF. Let I be the set of inefficient quanta. Define an inefficient quantum q as productive if it
reduces the span by at least (1 - 6)L/6, and unproductive otherwise. For each quantum q E I,
define the indicator random variable Y to be 1 if q is productive and 0 otherwise. By Lemma 2.14,
we have Pr {Y = 1} > 1/4. Let the total number of productive quanta be Y = q, Y. For
simplicity in notation, let A = 6T/((1 - 6)L). If the job's execution contains II > 48To/((1 -
6)L)+16 In(1/c) inefficient quanta, then we have E [Y] > II /4 > 12To/((1-6)L) +4 In(1/) =
2A + 41n(1/e). Using the Chernoff bound Pr {Y < (1 - A)E[Y]} < exp(-A2 E[Y]/2) [MR95,
5Actually, the number of rounds is m = [(1 - ,,)L/3J, but we will ignore the roundoff for simplicity. A more
detailed analysis can nevertheless produce the same: constants in the bounds for Lemmas 2.15 and 2.18.
i
p. 70] and choosing A = (A + 4 In(1/E)) / (2A + 4 ln(1/E)), we obtain
Pr{Y < A}
= Pr Y< 1- A+4n(1/)(2A + 41n(l/))
2A + 41n(l/c)
= Pr{Y < (1 - A) (2A 41ln(1/E))}
< exp - (2A + 4 In(1/t))
exp 1 (A + 4 n(/1 ))
2
2 2A + 4 In(1]/)
< exp -2 4 1n(1/) -
Therefore, if the number of inefficient quanta is I 1 = 48T,/((1 - 6)L) + 16 ln(1/.), the number
of productive quanta is at least A = 6T,/((1 - 6) L) with probability at least 1 - C. By Lemma 2.14
each productive quantum reduces the span by at least (1 - 6)L/6, the total span reduced is no less
than T, with probability at least 1 - c when III .-:. 48T /((1 - 6)L) + 16 ln(1/ ). In order words,
with probability at least 1 - c, the job encounters no more than 48T, /((1 - 6)L) + 16 In(1/e) inef-
ficient quanta before it completes. Therefore, the number of inefficient quanta is I < 48T_/((1 -
6)L) + 16 In(1/c) with probability at least 1 - c. O
The following technical lemma bounds the maximum value of desire.
Lemma 2.16 Suppose that A-STEAL schedules a job on a machine with P processors. Let p
denote A-STEAL's responsiveness parameter Bebre any quantum q, the desire dq of the job is at
most pP.
PROOF. We use induction on the number of quanta. The base case dl = 1 holds trivially. If a
given quantum q -1 was inefficient, the desire dq decreases, and thus dq < d,_1 < pP by induction.
If quantum q - 1 was efficient-and-satisfied, then dq = pdq,1 = paq-1 < pP. If quantum q - 1
was efficient-and-deprived, then dq dq- < p' by induction. O1
The next lemma reveals a relationship between inefficient quanta and efficient-and-satisfied
quanta.
Lemma 2.17 Suppose that A-STEAL schedules a job on a machine with P processors. If p is
A-STEAL 's responsiveness parameter and the schedule produces m inefficient quanta, then it pro-
duces at most m, + logp P + 1 efficient-and-satisfied quanta.
PROOF. Assume for the purpose of contradiction that a job's execution produces k > m+log, P+
1 efficient-and-satisfied quanta. Recall that the desire increases by p after every efficient-and-
satisfied quantum, decreases by p after every inefficient quantum, and does not change otherwise.
Thus, the total increase in desire is pk, and the total decrease in desire is pm. Since the desire starts
at 1, the desire at the end of the job is pk- > plkg,, +1 > pP, contradicting Lemma 2.16. 1O
The following lemma bounds the number of efficient-and-satisfied quanta.
Lemma 2.18 Suppose that A-STEAL schedules ajob with span T. on a machine with P proces-
sors. Let p denote A-STEAL 's responsiveness parameter 6 the utilization parameter, and L the
quantum length. Then, for any c > 0, with pnobability at least 1 - c, the schedule produces at most
48T,/((1 - 6)L) + logp P + 16 In(1/e) efficient-and-satisfied quanta.
PROOF. The lemma follows directly from Lemmas 2.15 and 2.17. OI
The next lemma shows that for each inefficient quantum there exists a corresponding efficient-
and-satisfied quantum.
Lemma 2.19 Suppose that A-STEAL schedules ajob on a machine. Let I and C denote the set
of inefficient quanta and the set of efficient-and-satisfied quanta produced by the schedule. If p is
A-STEAL's responsiveness parameter then there exists an injective mapping f : I -+ C such that
for all q E I, we have f(q) < q and df(q) dq/p.
PROOF. -For every inefficient quantum q E I, define r = f(q) to be the latest efficient-and-
satisfied quantum such that r < q and dr =: dq/p. Such a quantum always exists, because the initial
desire is 1 and the desire increases only after an efficient-and-satisfied quantum. We must prove
that f does not map two inefficient quanta to the same efficient-and-satisfied quantum. Assume for
the sake of contradiction that there exist two inefficient quanta q < q' such that f(q) = f(q') = r.
By definition of f, the quantum r is efficient-and-satisfied, r < q < q', and dq = dq, = pdr. After
the inefficient quantum q, A-STEAL reduced the desire to dq/p. Since the desire later increased
again to dq, = dq and the desire increases only after efficient-and-satisfied quanta, there must be an
efficient-and-satisfied quantum r' in the range q < r' < q' such that d(r') = d(q')/p. But then, by
the definition of f, we would have f(q') = r'. Contradiction. Ol
We can now bound the total number of mug-cycles executed by processors.
Lemma 2.20 Suppose that A-STEAL scheduldes ajob with work T on a machine with Pproces-
sors. Let p denote A-STEAL 's responsiveness parameter, 6 the utilization parameter and L the
quantum length. Then, the schedule produces at most ((1 + p)/(L6 - 1 - p))T mug-cycles.
PROOF. When the allotment decreases, some processors are deallocated and their deques are
declared muggable. The total number M of mug-cycles is at most the number of muggable deques
during the job's execution. Since the allotment reduces by at most aq - 1 from quantum q to
quantum q + 1, there are M < Eq(aq - 1) < Eq aq mug-cycles during the execution of the job.
By Lemma 2.19, for each inefficient quantum q, there is a distinct corresponding efficient-and-
satisfied quantum r = f(q) that satisfies dV = pd,. By definition, each efficient-and-satisfied
quantum r has a nonsteal usage nr > L6a, and allotment ar = dr. Thus, we have n, + nq >
L6a, = ((L6)/(1 + p))(ar + par) = ((L6)/(1 + p))(ar + pd,) ((L6)/(1 + p))(ar + aq), since
aq, dq and dq = pdr. Except for these inefficient quanta and their corresponding efficient-and-
satisfied quanta, any other quantum q is efficient, and hence nq 2 L6aq for these quanta. Let
N = Eq nq be the total number of nonsteal-cycles during the job's execution. We have N =
Eq nq > ((L6)/(1 + p)) Zq aq ( )/(1 + p))M. Since the total number of nonsteal-cycles
is the sum of work-cycles and mug-cycles and the total number of work-cycles is T, we have
N = T + M, and hence, T = N - M 2 ((L6)/(1 + p))M - M = ((L6 - 1 - p)/(l + p))M,
which yields M < ((1 + p)(L6 - 1 - p)) T. LI
Lemma 2.21 Suppose that A-STEAL schedules ajob with work T on a machine with P proces-
sors. Let p denote A-STEAL's responsiveness parameter 6 the utilization parameter and L the
quantum length. Then, the schedule produces at most (T /(L6P 4 )) (1 + (1 + p)/(L6 - 1 - p))
accounted quanta, where PA is mean availability on accounted quanta.
PROOF. Let A and D denote the set of accounted and deductible quanta, respectively. The mean
availability on accounted quanta is PA = (1/ I Al) E qA Pq. Let N be the total number of nonsteal-
cycles. By definition of accounted quanta, the nonsteal usage satisfies nq > L6aq. Thus, we have
N = EqEAUD nq > EqEA nq > q A 6Lpq = 6L Al PA, and hence, we obtain
IAI < N/(LPA) . (2.1)
The total number of nonsteal-cycles is the sum of the number of work-cycles and mug-cycles.
Since there are at most T, work-cycles on accounted quanta and, by Lemma 2.20, there are at most
M < ((1 + p)(L6- 1 - p))T mug-cycles, we have N < T + Af < T,(1 + (1 + p)/(L6- 1 - p)).
Substituting this bound on N into Inequality (2.1) completes the proof.
We are now ready to bound the running time of jobs scheduled with A-STEAL.
Theorem 2.22 Suppose that A-STEAL schedules ajob with work Ti and span T, on a machine
with P processors. Let p denote A-STEAL's responsiveness parameter, 6 the utilization parameter,
and L the quantum length. For any c > 0, with probability at least 1 - E, A-STEAL completes the
job in
T, l+p T
T = 1 -p) +O 6 + L logP + Lln(1/c) (2.2)
time steps, where P is the O(T,/(1 - 6) + L log, P + L In(1/c))-trimmed availability.
PROOF. The proof is a trim analysis. Let A be the set of accounted quanta, and let D be the
set of deductible quanta. The overall number of time steps is thus at most L( IA + IDI). Lem-
mas 2.15 and 2.18 show that there are at most |D| = O(T,/((1 - 6)L) + log P + In(1/c))
deductible quanta with high probability, since efficient-and-satisfied quanta and inefficient quanta
are deductible. Hence there are at most L IDI = O(T/(1 - 6) + L log, P + L ln(1/c)) time steps
in deductible quanta with high probability. We have that PA > P, since the mean availability on
the accounted time steps (we trim the L IDI deductible steps) must be at least the O(T-/(1 - 6) +
L log, P + L ln(1/c))-trimmed availability (we trim the O(T,/(1 - 6) + L log, P + Lln(1/6))
steps that have the highest availability). From Lemma 2.21, the number of accounted quanta is
bounded by Al = (T 1/(L3P 4))(1 + (1 + p)/(L3 - 1 - p)). Combining the two parts, the desired
time bound follows. Ol
Corollary 2.23 Suppose that A-STEAL schedules a job with work T and span T, on a machine
with P processors. Let p denote A- STEAL's responsiveness parameter, 6 the utilization parameter,
and L the quantum length. Then, A-STEAL completes the job in expected time E [T] = O(T / P + T, + L ig F
where P is the O(T + L Ig P)-trimmed availability.
PROOF. With probability (1 - e), A-GREEDY completes the job in time
T = + L6- =1- p) + O  1- +L ogP+LIn(1/E)) (2.3)
(2.4)
for some constant C. In order to get the expectation, we assume that the maximum completion
time is Ti if the job runs entirely sequentially.
E[T] = (1- )T +cT
(= 1- ) 1+ L6j-1+ +0 ( T +LlogpP+Lln(1/e) +cTi
Substitute c = 1/P and assume that 6 and p are constants.
E[T] = (1- 1/P)0 -I- T + L in P + L 1n(P) +1/PT
-O + To- + L In P+I
The analysis leading to Theorem 2.22 and its corollary makes two assumptions. First, we
assume that the scheduler knows exactly how many steal-cycles have occurred in the quantum.
Second, we assume that the processors can find the muggable deques instantaneously. We now
relax these assumptions and show that they do not adversely affect the asymptotic running time of
A-STEAL.
A scheduling system can implement the counting of steal-cycles in several ways that impact
our theoretical bounds only minimally. For example, if the number of processors in the machine
P is smaller than the quantum length L, then the system can designate one processor to collect all
the information from the other processors at the end of each quantum. Collecting this information
increases the time bound by a multiplicative factor of only 1 + P/L. As a practical matter, one
would expect that P < L, since scheduling quanta tend to be measured in tens of milliseconds
and processor cycle times in nanoseconds or less, and thus the slowdown would be negligible.
Alternatively, one might organize the processors for the job into a tree structure so that it takes
O(lg P) time to collect the total number of steal-cycles at the end of each quantum. The tree
implementation introduces a multiplicative factor of 1 + (Ig P)/L to the job's execution time, an
even less-significant overhead.
The second assumption, that it takes constant time to find a muggable deque, can be relaxed
in a similar manner. One option is to mug serially, that is, while there is a muggable deque, all
processors try to mug according to a fixed linear order. This strategy could increase the num-
ber of mug-cycles by a factor of P in the worst case. If P < L, however, this change again
does not affect the running time bound by much. Alternatively, to obtain a better theoretical
bound, we could use a counting network [AHS94] with width P to implement the list of mug-
gable deques, in which case each mugging operation would consume O(lg2 P) processor cycles.
The number of accounted steps in the time bound from Lemma 2.21 would increase slightly
to (T 1/(6F))/ (1 + (1 + p) Ig2 P/(L6 - 1 - p)), but the number of deductible steps would not
change.
;1;14.;_ 1__~ _11/:_.i)l.iiC; i-;iiriii-i .i ?~ tiii--il~~ ~--. ._._ ...li.-. II...~_ -..~ ^_I._ ._^iii _-.ii -._ ~.l-.i- i;l: ; ?.-.~~ .- . .--.---. . ~-li -ll;_l.--
Waste analysis
The next theorem bounds the waste, which is the total number of mug- and steal-cycles.
Theorem 2.24 Suppose that A-STEAL schedules ajob with work T on a machine with Pproces-
sors. Let p denote A-STEAL's responsiveness parameter, 6 the utilization parameter and L the
quantum length. Then, A-STEAL wastes at most
S(1  p - 6 - +(l P)2 )T (2.5)
6 + (LS - 1 - p)
processor cycles in the course of computation.
PROOF. Let M be the total number of mug-cycles, and let S be the total number of steal-cycles.
Hence, we have W = S + M. Since Lemma 2.20 bounds M, we only need to bound S, which we
do using an accounting argument based on whether a quantum is inefficient or efficient. Let Sineff
and Seff, where S = Sineff + Seff, be the numbers of steal-cycles on inefficient and efficient quanta,
respectively.
Inefficient quanta Lemma 2.19 shows that every, inefficient quantum q with desire dq has a dis-
tinct corresponding efficient-and-satisfied quantum r = f(q) with desire d, = dq/p. Thus,
the steal-cycles on quantum q can be amortized against the nonsteal-cycles on quantum r.
Since quantum r is efficient-and-satisfied, its nonsteal usage satisfies n,. _ L6aq/p and its
allocation is a = dr. Therefore, we have > Lar = L6dr = L6dq/p > L6 aq/p. Let
s, be the number of steal-cycles in quantum q. Since there are Laq processor cycles in the
quantum, we have Sq < Laq < pnr/6, that is, the number of steal-cycles in the inefficient
quantum q is at most a p/6 fraction of the nonsteal-cycles in its corresponding efficient-
and-satisfied quantum r. Therefore, the total number of steal-cycles in all inefficient quanta
satisfies Sineff < (p/6)(Ti + M).
Efficient quanta On any efficient quantum q, the job has at least L 6aq work- and mug-cycles and
at most L(1 - 6)aq steal-cycles. Summing over all efficient quanta, the number of steal-cycles
on efficient quanta is Seff < ((1 - 6)/6)(TI + MA).
The total waste is therefore W = S + M = Sieff + Seff + AM < (Ti + M)(1 + p - 6)/6 + M.
Since Lemma 2.20 provides M < T (1 + p)/(L6 - 1 - p), the theorem follows. O
Interpretation of the bounds
If the utilization parameter 6 and responsiveness parameter p are constants, the bounds in Equa-
tion (2.2) and Inequality (2.5) can be simplified somewhat as follows:
TTT = (I + O(1/L)) +0 +. -c LlogpP+ Ln(1/) ,
(1+p- 6 + O(1/L) 7-. (2.6)
This reformulation allows us to more easily see the trade-offs due to the setting of the 6 and p
parameters.
In the time bound, as 6 increases toward 1, the coefficient of Ti/P decreases toward 1, and
the job comes closer to perfect linear speedup on accounted steps. The number of deductible steps
increases at the same time, however. Moreover, as 6 increases and p decreases, the completion time
increases and the waste decreases. The utilization parameter 6 may lie between 80% and 95%, and
the responsiveness parameter p can be set between 1.2 and 2.0. The quantum length L is a system
configuration parameter, which might have values in the range 103 to 105.
To see how these settings affect the waste bound, consider the waste bound as comprising two
parts, where the waste due to steal-cycles is S < ((1 + p - 6)/6)T and the waste due to mug-cycles
is M = O(1/L)T1. We can see that the waste due to mug-cycles is just a tiny fraction compared to
the work T1. Thus, these bounds indicate that adaptive scheduling with parallelism feedback can
be achieved without imposing much overhead when adding to or removing processors from jobs.
The major part of waste comes from steal-cycles, where S is generally less than 2T for typical
parameter values. The analysis of Theorem 2.24 shows, however, that the number of steal-cycles in
efficient steps is bounded by ((1 - 6)/6)T1, which is a small fraction of S. Thus, most of the waste
comes from the steal-cycles in inefficient quanta. Our analysis assumes that the job scheduler is an
adversary, creating as many inefficient quanta as possible. Of course, job schedulers are generally
not adversarial. Thus, in practice, we expect the waste to be a much smaller fraction of T than our
bounds. Chapter 3 describes experiments that confirm this intuition.
2.8 Related Work
This section discusses related work on adaptive scheduling of parallel jobs. There has been a large
amount of work on nonadaptive thread scheduling, both using greedy scheduling and work stealing.
Work in the area of adaptive scheduling has generally centered on job schedulers, some of which
use "dynamic equipartitioning" as a strategy for allotting processors to jobs.
Greedy scheduling is an old scheduling strategy, first introduced by Graham [Gra69], and
later independently invented by Brent [Bre74]. Some version of it has since been implemented
in many data parallel languages such as Fortran (HPF) [For93], NESL [BCH+94, BG96], and
ZPL [CCL+00]. Of particular interest is the work by Blelloch and his coauthors [BG96, NB99,
BGM99] which provides various nonadaptive task schedulers for a generalized class of data-
parallel jobs, called nested data-parallel jobs. Specifically, their "prioritized" task schedulers are
provably efficient with respect to both time and space. A-GREEDY can be combined with priori-
tized task schedulers to produce adaptive task schedulers that are provably efficient with respect to
time, space, and waste.
Work-stealing has been used as a heuristic since Burton and Sleep's research [BS81] and Hal-
stead's implementation of Multilisp [Hal84]. Many variants have been implemented since then
[MKHJ90, HZJ94, FM87], and it has been analyzed in the context of load balancing [RSAU91],
backtrack search [KZ88], etc. Blumofe and Leiserson [BL99] proved that the work-stealing al-
gorithm is efficient with respect to time, space, and communication for the class of "fully strict"
multithreaded computations. Arora, Blumofe and Plaxton [ABP98] extended the time bound result
to arbitrary multithreaded computations. Various researchers [HLMSO6, HS02, CL05] have since
simplified and improved the memory allocation for deques for ABP. In addition, Acar, Blelloch,
and Blumofe [ABBOO] showed that work-stealing schedulers are efficient with respect to cache
misses for jobs with "nested parallelism." Variants of work-stealing algorithms have been imple-
.... ..... .... .
mented in many systems [BJK+95, FLR98, BP99], and empirical studies show that work-stealing
schedulers are scalable and practical [FLR98, BP98b].
Adaptive thread scheduling without parallelism feedback has been studied in the context of
multithreading, primarily by Blumofe and his coauthors [BL97, BP94, ABP98]. In this work, the
thread scheduler schedules threads using a "work-stealing" [MKHJ90, BL99] strategy, but it does
not provide the feedback about the job's parallelism to the job scheduler. The work in [BL97, BP94]
addresses networks of workstations where processors may fail or join and leave a computation
while the job is running, showing that work-stealing provides a good foundation for adaptive thread
scheduling. In theoretical work, Arora, Blumofe, and Plaxton [ABP98] exhibit a work-stealing
thread scheduler that provably completes a job in O(T /P + PTm/P) expected time, where P is
the average number of processor allotted to the job by the job scheduler. Although they provide
no bounds on waste, one can prove that their algorithm may waste Q(PT,) processor cycles in an
adversarial setting.
Adaptive job schedulers have been studied empirically [MVZ93, YL01, TG89, LV90, MEB88,
NSS93] and theoretically [Gu95, DD96, MPT93, Edm99, ECBD03, BDKS04]. McCann, Vaswani,
and Zahorjan [MVZ93] studied many different job schedulers and evaluated them on a set of bench-
marks. They also introduced the notion of dynamic equipartitioning, which gives each job a fair
allotment of processors, while allowing processors that cannot be used by a job to be reallocated
to other jobs. Their studies indicate that dynamic equipartitioning may be an effective strategy for
adaptive job scheduling. Gu [Gu95] proved that dynamic equipartitioning with instantaneous par-
allelism feedback is 4-competitive with respect to makespan for batched jobs with multiple phases,
where the parallelism of the job remains constant during the phase and the phases are relatively
long compared with the length of a scheduling quantum. Deng and Dymond [DD96] proved a
similar result for mean response time for multiphase jobs regardless of their arrival times. Song
[Son98] proves that a randomized distributed strategy can implement dynamic equipartitioning.
Subsequent to the work described in this chapter, my collaborators extended this work [HHL06,
HHL07] analyzing the performance of A-GREEDY or A-STEAL when combined with dynamic
equipartitioning and roundrobin job schedulers. They find that a combination of A-GREEDY and
A-STEAL with these "nice" job schedulers yields makespan that is constant competitive with the
optimal. In addition, if all the jobs arrive at the same time, then these combinations also lead to
constant-competitive mean completion time.

Chapter 3
Experimental Evaluation of Adaptive Work
Stealing with Parallelism Feedback
In the previous chapter, we described A-STEAL, a work stealing algorithm for adaptive schedul-
ing with parallelism feedback. In this chapter, we shall evaluate A-STEAL experimentally. Our
studies monitored the behavior of A-STEAL on a simulated multiprocessor system using synthetic
workloads. Our experiments indicate that A-STEAL provides good performance on moderately to
heavily loaded large machines. In addition, we compared A-STEAL with an adaptive scheduler that
does not provide parallelism feedback[ABP98], and A-STEAL compares favorably in performance.
This chapter is organized as follows: Section 3.1 provides the summary of our experiments,
Section 3.2 explains the experimental setup, and Sections 3.3, 3.4, 3.5, and 3.6 provide details
about four sets of experiments.
3.1 Summary of Experiments
To evaluate the performance of A-STEAL empirically, we built a discrete-time simulator using
DESMO-J [DES99]. Some of our experiments benchmarked A-STEAL against ABP [ABP98],
an adaptive thread scheduler by Arora et al. that does not supply parallelism feedback to the job
scheduler. This section describes our simulation setup and the results of the experiments.
We conducted four sets of experiments on the simulator with synthetic jobs. Our results are
summarized below:
* The time experiments investigated the performance of A-STEAL on over 2300 job runs. A
linear-regression analysis of the results provides evidence that the coefficients on the number
of accounted and deductible steps are considerably smaller than the upper bounds provided
by our theoretical bounds. A second linear-regression analysis indicates that A-STEAL com-
pletes jobs on average for at most twice the optimal number of time steps, which is the same
bound provided by offline greedy scheduling [Gra69, Bre74].
* The waste experiments are designed to measure the waste incurred by A-STEAL in practice
and compare the observed waste to the theoretical upper bounds. Our experiments indicate
This is joint work with Yuxiong He and Charles E. Leiserson [AHL06].
that the waste is almost insensitive to -the parameter settings and is a tiny fraction (less than
10%) of the work for jobs with high parallelism.
* The time-waste experiments compare the completion time and waste of A-STEAL with ABP
[ABP98] by running single jobs with predetermined availability profiles. These experiments
indicate that on large machines, when the mean availability P is considerably smaller than
the number P of processors in the machine, A-STEAL completes jobs faster than ABP while
wasting fewer processor cycles than ABP. On medium-sized machines, when P is of the
same order as P, ABP completes jobs. slightly faster than A-STEAL, but it still wastes many
more processor cycles than A-STEAL.
* The utilization experiments compare the utilization of A-STEAL and ABP when many jobs
with varying characteristics are using the same multiprocessor resource. The experiments
provide evidence that on moderately to heavily loaded large machines, A-STEAL consistently
provides a higher utilization than ABP for a variety of job mixes.
3.2 Simulation Setup
Our Java-based discrete-time simulator, which was implemented using DESMO-J [DES99], imple-
ments four major entities - processors, jobs, thread schedulers, and job schedulers. The simulator
tracks their interactions in a two-level scheduling environment. We modeled jobs as dags, which
are executed by the thread scheduler. When a job is submitted to the simulated multiprocessor
system, an instance of a thread scheduler is. created for the job. The job scheduler allots processors
to the job, and the thread scheduler simulates the execution of the job using work-stealing. The
simulator operates in discrete time steps: a processor can complete either a work-cycle, steal-cycle,
or mug-cycle during each time step. We ignored the overheads due to the reallocation of processors
in the simulation.
We tested synthetic multithreaded jobs, with the parallelism profile shown in Figure 3.1. Each
job alternates between a serial phase of length w and a parallel phase (with h-way parallelism) of
length w2 . The average parallelism of the job is approximately (w + hw2)/(wl + w2). By varying
the values of wl, w2 , h, and the number of iterations, we can generate jobs with different work,
span, and frequency of the change of the parallelism.
In the time-waste experiments and the utilization experiments, we compared the performance
of A-STEAL with that of another thread scheduler, ABP [ABP98], an adaptive thread scheduler
that does not provide parallelism feedback to the job scheduler. In these experiments, ABP is
always allotted all the processors available to the job. ABP uses a nonblocking implementation of
work-stealing and always maintains P deques. When the job scheduler allots aq = pq processors
in quantum q, ABP selects aq deques uniformly at random from the P deques, and the allotted
processors start working on them. Arora, Blumofe, and Plaxton [ABP98] prove that ABP completes
a job in expected time
T = O(T1/P + PToo/P), (3.1)
where P is the average number of processors allotted to the job by the job scheduler.
Although Arora et al. provide no bounds ,on waste, one can prove that ABP may waste Q(PT,)
processor cycles in an adversarial setting. For a job which is completely sequential, we have T =
ci)
Time
Figure 3.1: The parallelism profile (for 2 iterations) of the jobs used in the simulation.
T, < PT,. A job scheduler may allot all P processor cycles to this job. Therefore, the total
number of processor cycles allotted to the job is PIT, since the job completes in T, time. The job
uses only T = T, processor cycles. Therefore, ABP wastes O(PT,) processor cycles.
We implemented three kinds of job schedulers: profile-based, equipartitioning [MVZ93], and
dynamic equipartitioning [MVZ93]. A profile-based job scheduler was used in the first four sets of
experiments, and both equipartitioning and dynamic equipartitioning job schedulers were used in
the utilization experiment. An equipartitioning (EQ) job scheduler simply allots the same number
of processors to all the active jobs in the system. ;Since ABP provides no parallelism feedback,
EQ is a suitable job scheduler for ABP's scheduling model. Dynamic equipartitioning (DEQ) is
a dynamic version of the equipartitioning policy, but it requires parallelism feedback. A DEQ job
scheduler maintains an equal allotment of processors to all jobs with the constraint that no job is
allotted more processors than it requests. DEQ is compatible with A-STEAL's scheduling model,
since it can use the feedback provided by A-STEAL to decide the allotment.
For the first three experiments - time, waste, and time-waste - we ran a single job with a
predetermined availability profile: the sequence of processor availabilities p, for all the quanta
while the job is executing. For the profile-based job scheduler, we precomputed the availability
profile, and during the simulation, the job scheduler simply used the precomputed availability for
each quantum. We generated three kinds of profiles:
* Uniform profiles: The processor availabilities in these profiles follow the uniform distri-
bution in the range from 1 to the maximum number P of processors in the system. These
profiles represent near-adversarial conditions for A-STEAL, because the availability for one
quantum is unrelated to the availability for the previous quantum.
* Smooth profiles: In these profiles, the: change of processor availabilities from one scheduling
quantum to the next follows a standard normal distribution. Thus, the processor availability is
unlikely to change significantly over two consecutive quanta. These profiles attempt to model
situations where new arrivals of jobs are rare, and the availability changes significantly only
when a new job arrives.
* Practical profiles: These availability profiles were generated from the workload archives [Fei]
of various computer clusters. We computed the availability at every quantum by subtracting
the number of processors that were being used at the start of the quantum from the number
of processors in the machine. These profiles are meant to capture the processor availability
in practical systems.
A-STEAL requires certain parameters as input. The responsiveness parameter is p = 1.5 for
all the experiments. For all experiments except the waste experiments, the utilization parameter
is 6 = 0.8. We varied 6 in the waste experiments. The quantum length L represents the time
between successive reallocations of processors by the job scheduler and is selected to amortize the
overheads due to communication between the job scheduler and the thread scheduler and to the
reallocation of processors. In conventional computer systems, a scheduling quantum is typically
between 10 and 20 milliseconds. Our experience with the Cilk runtime system [Sup03] indicates
that a steal/mug-cycle takes approximately 0.5 to 5 microseconds, suggesting that the quantum
length L should be set to values between 10P and 105 time steps. Our theoretical bounds indicate
that as long as T, > L log P, the length of L should have little effect on our results. Due to
the performance limitations of our simulation environment, however, we were unable to run very
long jobs: most have span in the order of only a few thousand time steps. Therefore, to satisfy the
condition that T > L log P, we set L = 200.
3.3 Time Experiments
The running-time bounds proved in Chapter 2, Section 2.7, though asymptotically strong, have
weak constants. The time experiments were designed to investigate what constants would occur
in practice and how A-STEAL performs compared to an optimal scheduler. We performed linear-
regression analysis on the results of 2331 job runs using many availability profiles as decided earlier
to answer these questions.
Our first time experiment uses the bounds in Equation (2.2) as a simple model, as in the study
[BJK+96]. Assuming that equality holds and disregarding smaller terms, the model estimates per-
formance as
T clJT1/P + cOOTO, (3.2)
where cl > 0 is the work overhead and c, > 0 is the span overhead. When 6 = 0.8, p = 1.5, and
L = 200, the coefficients for the asymptotic bounds in Equation (2.2) turn out to be 1.26 < cl <
1.27 and c, = 480, but a direct analysis of expectation can improve the bound on span overhead to
c, = 60. Since the span overhead c, is large, the bound indicates that A-STEAL may not provide
linear speedup except when T 1/T > 60P. Moreover, on accounted time steps, A-STEAL might
not provide perfect linear speedup, since the work overhead is 1.26 > 1.
In practice, however, we should not expect these large overheads to materialize. First, our anal-
ysis is focused on asymptotic bounds and use bounding techniques such as Markov's inequality and
;;; _jC* ~~i~il~l_~;-~_-tl l-i-; lii_~i-ii:-iiiiri~ii-=i~l-~-- ^_r-il~l ii--~--_- ----~ii:liL~i ~ l
unifor1 proffle
smooth profile
practical profile
oo
4QQ3O4
0.01
0.01 0.1 1 10
P/Parallelism
Figure 3.2: Comparing the (true) mean availability P with the trimmed availability P using three availability
profiles. Each data point represents a job execution for which the mean availability and trimmed availability
were measured. These values were normalized by dividing by the parallelism T /Too of the job. When the
parallelism satisfies T /Too > 5P, the experiments indicate that for all profiles, the trimmed availability is a
good approximation of the mean availability. All these experiments used 6 = 0.8 and p = 1.5.
L
Chernoff bounds, which are not necessarily tight. Second, our analysis assumes that the job com-
pletes only the minimum number of work-cycles in each quantum, specifically, 0 on a deductible
quantum and 6Laq on an accounted quantum with allotment aq.
Our first linear-regression analysis fits the running time of the 2331 job runs to Equation (3.2).
The trimmed mean P of a job run is computed as the average processor availability of all accounted
steps during the execution of the job. The least-squares fit to the data to minimize relative error
yields cl = 0.960 + 0.003 and co, = 0.812 ± 0.009 with 95% confidence. The R 2 correlation
coefficient of the fit is 99.4%. Since c., : 0.812 ± 0.009, on average the jobs achieved linear
speedup when TI/T, > P. In addition, since we have cl = 0.960 ± 0.003, A-STEAL achieves
almost perfect linear speedup on the accounted steps. The fact that cl < 1 stems from the fact that
jobs performed some work during the deductible steps.
We performed a second set of regression tests on the same set of jobs to compare the perfor-
mance of A-STEAL with an optimal scheduler. We fit the job data to the curve
T = IT/P + 8,TO . (3.3)
The analysis yields 1 = 0.992 ± 0.003 and co = 0.911 ± 0.008 with an R 2 correlation coefficient
of 99.4%. Both Ti/P and To are lower bounds on the job's running time, and thus an optimal
scheduler requires at least max { T1/P, Too} > (T/P +Too)/2 > (81 T/P++ 8CT) /2 time steps,
since C1 < 1 and 00 < 1. Consequently, on average A-STEAL completed the jobs within at most
twice the time of an optimal scheduler.
The Equations (3.2) and (3.3) both predict performance with high accuracy, and yet P and P
can diverge significantly. To resolve this paradox, we compared P and P on the job runs. Figure 3.2
shows a graph of the results, where P and P are each normalized by dividing by the parallelism
TI/T0 of the job. The diagonal line in the figure is the curve P = P.
If a job has parallelism T1/Too > 5P (data points on the left), the experiment indicates that for
all three kinds of availability profiles, we have P - P. In this case, we have T1/P P T1/P and
T 1/P > To, which implies that the first terms in Equations (3.2) and (3.3) are nearly identical and
dominate the running time. On the other hand, if a job has small parallelism (data points on the
right), the values of P and P diverge and the divergence depends on the availability profile used.
In this region, however, the running time is dominated by the span Too, and thus, the divergence of
P and P has little influence on the running time.
3.4 Waste Experiments
Our theoretical analysis shows that the waste exhibited by A-STEAL is at most O(TI). The constant
hidden in the O-notation depends on the parameter settings. In our first waste experiment, we
varied the value of the utilization parameter 6 to determine the relationship between the waste and
the setting of 6. For our second experiment, we investigated whether the waste incurred by a job
depends on the job's parallelism.
The proof of Theorem 2.6 shows that the number of processor cycles wasted by a job is ((1 -
6)/6)T1 on efficient quanta and approximately (p/6)T on inefficient quanta. Substituting 6 = 0.8
and p = 1.5, A-STEAL could waste as many as 0.25T, processor cycles on efficient quanta and as
many as 1.875T processor cycles on inefficient quanta. Since this analysis assumes that the job
scheduler is an adversary and that the job completes the minimum number of work-cycles in each
quantum, we did not expect these constants to materialize in practice.
We measured the waste for 300 jobs, most of which had parallelism T 1/T, > 5P, for 6 =
0.5,0.6,..., 1.0. The job runs used many availability profiles drawn equally from the three kinds.
Figure 3.3 shows the average of waste normalized by the work T of the job. For comparison we
plotted the normalized theoretical bound Inequality (2.6) for the total waste and the normalized
bound ((1 - 6)/6)T1 for the waste on efficient quanta. As the figure shows (although the curve
is barely distinguishable from the x-axis), the observed waste is less than 10% of the work T1 for
most values of 6 and is considerably less than what the theoretical bounds predicted. Moreover, the
waste seems to be quite insensitive to the particular value of 6.
We also ran an experiment to determine whether parallelism has an effect on waste. The bound
in Inequality (2.6) does not depend on the parallelism T /T, of the job, but only on the work T.
For the 2331 job runs used in the time experiments, we measured the waste versus parallelism.
Since waste is insensitive to 6, all jobs used the value 6 = 0.8. Figure 3.4 graphs the results.
As can be seen in the figure, the higher the parallelism, the lower the waste-to-work ratio. The
reason is that when the parallelism is high, the job can usually use most of the available processors
without readjusting its desire. When the parallelism is low, however, the job's desire must track its
parallelism closely to avoid waste. This situation is where A-STEAL is most effective, as the job
pushes the theoretical waste bounds to their limit.
3.5 Time-Waste Experiments
The time-waste experiments were designed to compare A-STEAL with ABP, an adaptive thread
scheduler with no parallelism feedback. For our first experiment, we ran A-STEAL and ABP to
execute 756 job runs on a simulated machine with P = 512 processors. Each head-to-head run used
one of two practical availability profiles, one with P = 30 and one with P = 60. We measured the
time and waste of A-STEAL and ABP for each run. Our second experiment was similar, but it used
only P = 128 processors in the simulated machine over 330 job runs. Whenever the availability
exceeded 128, which was not often, we chopped the availability to 128.
Figure 3.5 shows the ratio of ABP to A-STEAL with respect to both time and waste as a func-
tion of the mean availability P, normalized by dividing by the parallelism T1/T,. This experiment
shows that A-STEAL completed jobs about twice as fast as ABP while wasting only about 10% of
the processor cycles wasted by ABP. Not surprisingly, A-STEAL wastes fewer processor cycles
than ABP, since A-STEAL uses parallelism feedback to limit possible excessive allotment. Para-
doxically, however, A-STEAL completes jobs faster than ABP, even though A-STEAL's allotment
in every quantum is at most that of ABP, which is always allotted all the available processors.
ABP's slow completion is due to how ABP manages its ready deques. In particular, ABP has
no mechanism for increasing and decreasing the number r of ready deques, and it maintains r = P
deques throughout the execution. Randomized work-stealing algorithms require e(r) steal-cycles
to reduce the length of the span by 1 in expectation. Consequently, if r is large, each steal-cycle
becomes less effective, and the job's progress along its span slows. Thus, if the job has small or
moderate parallelism (data points on the right), the span dominates the running time. If the job
has large parallelism (data points on the left), however, the impact is less. In contrast, A-STEAL
continues to make good progress along the span, regardless of parallelism, by reducing the number
of deques according to its allotment.
This paradox can also be understood by using the model from Equation (3.2) for A-STEAL and
an analogous model based on Equation (3.1) for ABP. Let us consider three cases:
* T 1/T < P < P (data points on the right): Whereas A-STEAL completes the job in E (T,)
time, ABP requires 6(PT,/o/P) time.
* P < T 1/Tc < P (data points in the middle): A-STEAL provides linear speedup since
T 1/Too > P, but ABP does not, since TI/Too < P.
* P < T1 /To (data points on the left): Both provide linear speedup in this range.
Since ABP performed relatively poorly when P is large compared to P, our second experiment
investigated the case when P is closer to P. Figure 3.6 shows the results on 330 job runs on a
simulated machine with P = 128. In this case, when jobs' parallelism is large compared to P,
both ABP and A-STEAL perform about the same with respect to both time and waste. As the
parallelism gets closer to P, ABP performs slightly better than A-STEAL with respect to time and
slightly worse with respect to waste. Since P P, the two models coincide, and ABP and A-
STEAL perform comparably. Therefore, on small machines, where the disparity between P and P
cannot be very great, the advantage of parallelism feedback is diminished, and ABP may yet be an
effective thread-scheduling algorithm.
3.6 Utilization Experiments
The utilization experiments compared A-STEAL with ABP on a large server where many jobs are
running simultaneously and jobs arrive and leave dynamically. We implemented job schedulers
to allocate processors among various jobs: dynamic equipartitioning [MVZ93] for A-STEAL and
equipartitioning [TG89] for ABP. We simulated a 1000-processor machine for about 106 time
steps, where jobs had a mean interarrival time of 1000 time steps. We compared the utilization
provided by A-STEAL and ABP over time.
It was unclear to us what distribution the parallelism and the span should follow. Although
many workload models for parallel jobs have been studied [Sev94, Fei96, Dow98, CB01, LF03],
none appears to apply directly to multithreaded jobs. Some studies [LO86, HBD97, HB99] claim
that the sizes of Unix jobs follow a heavy-tailed distribution. Lacking a well-recognized guideline,
we decided to try various distributions, and as it turned out, our results were fairly insensitive to
which we chose.
We considered 9 sets of jobs using three distributions on each of the parallelism and the span.
The means of the distributions were chosen so that jobs arrive faster than they complete and the
load on the machine progressively increases. Thus, we were able to measure the utilization of the
machine under various loads. The three distributions we explored were the following:
* Uniform distribution (U): The span is picked uniformly from the range 1,000 to 99, 000.
The parallelism is generated uniformly in the range [1, 80].
* Heavy-tailed distribution 1 (HT1): We used a Zipf's-like [Zip49] heavy-tailed distribu-
tion where the probability of generating x is proportional to 1/x. In our experiments, the
_I_
distribution for parallelism has mean value 36, and the distribution for span has mean value
50,000.
* Heavy-tailed distribution 2 (HT2): In this distribution, the probability of generating x is
proportional to 1/ Vf. In our experiments, the distribution for parallelism has mean value 36,
and the distribution for span has mean value 50, 000.
Of the 9 possible sets of jobs, we ran 6 experiments using parallelism and span drawn from
U/U, U/HT1, HT1/U, HT1/HT1, HT2/U, and HT2/IT2. For all these experiments, the comparison
between A-STEAL+DEQ and ABP+EQ followed the same qualitative trends. We broke time into
intervals of 2000 time steps and measured the utilization - the fraction of processor cycles spent
working - for each interval. Figure 3.7 shows the utilization as a function of time (log-scale) for
the U/U experiment at the top and for HT1/HT1 on the bottom. As can be seen in both figures,
ABP+EQ starts out with a higher utilization, since A-STEAL+DEQ initially requests just one pro-
cessor. Before 10% of the simulation has elapsed, however, A-STEAL+DEQ overtakes ABP+EQ
with respect to the utilization and then consistently provides a higher utilization. Although the
figure does not show it, the mean completion time of jobs under ABP+EQ is nearly 50% less than
those under A-STEAL+DEQ for both these distributions.
3.7 Related Work
In this section, we mention some of the related work not already touched upon in Chapter 2.
The paper most related to this work is one by Arora, Blumofe and Plaxton [ABP98] which
compared A-STEAL against. Adaptive task scheduling without parallelism feedback has also been
studied empirically in the context of data-parallel languages [EAS+95, EASS94]. This work fo-
cuses on compiler and runtime support for environments where the number of processors changes
while the program executes. Adaptive task scheduling with parallelism feedback has been studied
empirically in [TBB96, Son98, Sen04]. These researchers use a job's history of processor utiliza-
tion to provide feedback to dynamic-equipartitioning job schedulers. These studies use different
strategies for parallelism feedback, and all report better system performance with parallelism feed-
back than without, but it is not apparent which strategy is superior.
As we mentioned in Chapter 2, my collaborators extended this work [HHL06, HHL07] analyz-
ing the performance of A-GREEDY or A-STEAL when combined with dynamic equipartitioning
and roundrobin job schedulers. In addition to the theoretical results, they also performed simula-
tion studies using the same simulation environment described in this chapter. They find that the
performance of a concurrency platform that combines a thread scheduler using A-STEAL with a
job scheduler that uses dynamic equipartioning performs very well on a large variety of job mixes.
0.5 0.6 0.7 0.8 0.9 1
Figure 3.3: Comparing the theoretical and practical waste (normalized by T1 ) using A-STEAL for various
values of the utilization parameter 6. The top line shows the total theoretical waste, the next line shows the
theoretical waste on efficient quanta, and the bottom line shows the observed waste. The observed waste
appears to be almost insensitive to the value of 6 and is much smaller than the theoretical waste.
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0.1 1 10 100 1000
Parallelism/P
Figure 3.4: How waste varies with parallelism. When T1/T, > 10P, that is, the job's parallelism signif-
icantly exceeds the average availability, the observe(6Waste is only a tiny fraction of the work T 1. For jobs
with small parallelism, the waste showed a large variance but never exceeded the work T1 in any of our runs.
The utilization parameter was 6 = 0.8 for all job, runs.
theoretical value of waste/T1  --
theo etical value ofwaste/T4 for efficient steps --- -----
- experimental result of waste/TI
--
,
q,a
r
4
3.5
3
2.5
2
1.5
1
0.5
: -Ii~i-~ ~~~" "~l~' *~"~l ~'~;~"'i~i~~a; L-l ;---r-f~;~~_-~.~-i-l-r _ l - - )~-- -------- __ _ _l_- =--.
-- -- -- - 1 _ _ 1
--
--
--
-
--
--
--
-
- --------- ---- --- -- -- i -, - -In1 ............................ r ............
100
H
H
.<
¢J
0.1
P/Parallelism
100
H
-.
0
mr
<
r.,d
0.1 1
P/Parallelism
Figure 3.5: Comparing the time and waste of A-STEAL against ABP when P = 512 and P = 30, 60. In
this experiment, where P exceeds P by a significant margin, A-STEAL completes jobs about twice as fast
as ABP while wasting less than 10% of the processor cycles wasted by ABP.
Time Ratio
+ + +~ +-
' i-0I ' I I I x . ..
Waste Ratio 
x
XxX
xXx
~XXX XX
XX~~~~~ X~) X > ,< X>
X. X ( Xx
<x X
x x xSx X Xx#x~X X xXr* XXX K X X X X xX, W X ,X X XX) XXX _11  XX
H<
H
10
1
H
-a,
r
0
0p,
ag
Time Ratio
+ t- 4- + " --f 1
± , + 4 - -
- +++ ± - *++ + +-+ +
4-
+ ++
0.1 1
P/Parallelism
0.1 1 10
P/Parallelism
Figure 3.6: Comparing the time and waste of A-STEAL against ABP when P = 128 and P = 30, 60. In
this experiment, where P and P are closer in magnitude, A-STEAL runs slightly slower than ABP, but it still
tends to waste fewer processor cycles than ABP.
;~_~~;~~_~_~;~~_1~~~~~~i;~~ .~ _;____; _ ;; ~~_; _;._: .:..;.~.; ....._,,...:_;. ~._._:__.~,.rrr- --- -.i--i--)--l I-.r--)- -:- l.i~~::~r i;-;--':;i--i- ^---~I ~I:--^~
10.9-
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 -
0
le+003
le+004 le+005
Time Steps
le+006
le+006
Figure 3.7: Comparing the utilization over time of A-STEAL+DEQ and ABP+EQ. In the top figure, both
the span and the parallelism follow the uniform distribution, and in the bottom figure, they follow the HTI
distribution.
le+004 le+005
Time Steps
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 1-
le+003

Chapter 4
Library for Dag Evaluations in Cilk++
Many programming problems can be formulated as computations on a directed acyclic graph,
where every node represents some computation, and a directed edge represents the constraint that
the predecessor's computation depends on the result of successor's computation. We say that an
evaluation of a dag computation D, or a dag evaluation of D for short, is the result from perform-
ing the computations for all nodes in the dag D. Often, in order to evaluate these dag evaluations
efficiently, programmers have to write their own task pool schedulers. This chapter describes a
concurrency platform, called DAGEVAL, for evaluating such computations efficiently. This con-
currency platform is a Cilk++ library and it allows programmers to exploit parallelism in a dag
computation without writing their own task-pool schedulers.
The outline of the chapter is as follows: Section 4.1 presents some motivation for solving the
problem, Section 4.2 describes the interface and the implementation of the dag evaluation library,
and Section 4.3 shows the theoretical analysis of performance of this library. Sections 4.4, 4.5,
and 4.6 describe the experimental setup, applications and the experiments for evaluating the per-
formance of the library.
4.1 Motivation and Results
Many parallel computations are best described in the form of a dag D = (Dv, DE). Each node
B E Dv in the dag represents some computation that depends on all the successor of the node in
the dag. The result of a node can be computed when the results of all its children are available. The
goal is to compute all the results. Note that this dag is slightly different from the execution dag we
described in Chapter 1. For one, the dependencies go in the opposite direction: In an execution dag,
each node depends on its predecessors, while in D, each node depends on its successors. Second,
in an execution dag, each node has exactly one unit of work, while in D, each node can contain
an arbitrary amount of work. The most important difference, however, is that the execution dag
is a model of a parallel computation, while a dag evaluation is a programming methodology. An
execution dag is an a posteriori model of a parallel program and the programmer may know nothing
about it. On the other hand, a dag evalution is explicitly programmed as a dag by the programmer.
Writing sequential code for dag computations is straightforward; as shown in Figure 4.1, one
can traverse the dag nodes in a depth-first manner, computing each node after all its descendants
have been computed. dag evaluations often contain inherent parallelism, however. For example,
This work was done in collaboration with Charles E. Leiserson and Jim Sukha.
SEQ-EVALUATE(B)
1 for (D E children(B))
2 do if status[D] # COMPUTED
3 then SEQ-EVALUATE(D) l> Recursively compute the child
4 COMPUTE(B)
5 status[B] +- COMPUTED
Figure 4.1: Pseudocode for sequential evaluation of a dag rooted at A. COMPUTE(B) represents the com-
putation to be done at a node B.
two nodes that are not connected by any directed path in the dag can be computed in parallel.
Furthermore, a node's COMPUTE method may contain additional parallelism. The challenging
task, is to write a parallel program that effectively exploits both the dag's inherent parallelism and
the possible parallelism in the COMPUTE method.
In order to parallelize a dag evaluation,, programmers may design their own scheduler, possi-
bly based on a task pool [KR03], to manage threads and to execute dag nodes. This approach is
tedious and error-prone, however, since the programmer must explicitly track whether all the de-
pendencies have been satisfied in addition to managing synchronization and load balancing. Data
structures from existing concurrent libraries may simplify a scheduler for dag evaluations, but may
introduce unnecessary overheads due to their generic interface. For example, one might use a con-
current queue from Intel Thread Building Blocks (TBB) library [Rei07] to maintain the list of all
nodes that are ready to execute. Unfortunately, the queue's FIFO ordering introduces unnecessary
dependencies between nodes. Finally, as mentioned above, nodes' COMPUTE functions may also
contain parallelism, and the custom scheduler for dag evaluation must be designed to inter-operate
effectively with this other form of parallelism.
On the other hand, programmers might try to code dag evaluations using a concurrency plat-
forms, such as Cilk++ [Art09], which provide their own runtime scheduler. In Cilk++ code, pro-
grammers can spawn a function foo (), thereby specifying that foo () can potentially execute
in parallel with the code immediately after the spawn of foo () . Within a function, programmers
can sync to wait for all prior spawns in the function to complete. Cilk++, like its predecessor Cilk
[BJK+96, FLR98], relies on a work-stealing scheduler for efficient execution. Programmers can
easily express any fork-join parallelism inside the compute functions of individual nodes using
these mechanisms.
However, since the Cilk++ language supports only fork-join parallelism, it is impossible en-
code the dependencies for an arbitrary dag within the control flow of a Cilk++ program without
introducing additional synchronization. For example, one straightforward approach for expressing
arbitrary dag dependencies in a fork-join program is to use locks, as shown in Figure 4.2. This
program resembles the sequential dag evaluation shown in Figure 4.1. Unfortunately, using locks
in this manner limits the scalability of a Cilk++ program in practice. These locks also invalidate
the theoretical bounds on a program's completion time using a Cilk-like work-stealing scheduler.
Ideally, one would like to be able to easily express dag evaluations in a language such as Cilk++,
and take advantage of the efficient runtime schedulers provided by the language.
_ i_;;___~_;_______~_;I __~_^I -i..C-i.-j~;_;;~l.ilI -.~Jil-_~I-iiiiEl-JiI~.ll--~l-.--:~:.~.i :
LOCK-EVALUATE(B)
1 acquire (lock(B)) > Prevent other workers from interfering
2 for (D E children(B))
3 do if status[D] = COMPUTED
4 then spawn LOCK-EVALUATE(D) .> Recursively compute children
5 COMPUTE(B)
6 status[B] +- COMPUTED
7 release (lock(p))
Figure 4.2: dag Evaluation in a fork-join parallel language, using locks.
We have designed and implemented a library, called DAGEVAL, for parallel dag evaluations
in Cilk++. Our interface is based on C++ objects and template classes; to represent dag nodes,
the programmer creates an object which is a subtype of a base DAGNode class and overrides the
object's Compute method. We support our interface for dag evaluations purely as a Cilk++ tem-
plate library, without any modifications to the Cilk++ runtime system. As a result, DAGEVAL
easily supports nested spawn statements inside the Compute function of each node. Thus, using
DAGEVAL, it is easy to exploit both the inherent parallelism between the different dag nodes, and
the parallelism inside the COMPUTE function of each dag node. In addition, a similar approach can
be used to create libraries for dag evaluations for other fork-join parallel languages.
In DAGEVAL, we implement a parallel algorithm for dag evaluation using what we call eager
traversal of the dag. In an eager traversal, each worker thread greedily attempts to visit the dag
in a depth-first fashion, recursively trying to visit the children of each node it encounters. If any
children are already being visited when a worker gets to it, the worker does some bookkeeping to
record the fact, and moves on. Intuitively, in an eager traversal, the first worker to reach a node B
expands B - spawning visits to all its children --..- and the worker who computes the value of the
last uncomputed child of B enables the compute for B.
We evaluate DAGEVAL both theoretically and experimentally. Theoretically, for evaluations
whose dags have nodes with only constant indegree and outdegree, we show the runtime for eager
traversal is an asymptotically optimal O(T/P + T), where T1 is the sum of the work of the
computes of all the nodes in the dag, and T, is maximum over all paths p through the dag of
the sum of the spans of the nodes along p. Irregular dynamic programs, i.e., dynamic programs
which require a variable amount of time to compute every cell, represent one class of applications
which fit this paradigm of dag evaluations. Such dynamic programs appear frequently in algorithms
for computational biology (e.g., the Smith-Waterman algorithm (e.g., [SW81]). To evaluate the
performance of DAGEVAL, we implemented an irregular dynamic program on an 2D grid using
dag evaluations. Our empirical results indicate that when dag nodes are mapped to reasonably-
sized blocks of cells, DAGEVAL is competitive with divide-and-conquer implementations of the
same dynamic program. For example, for an N by N grid of cells with N = 4000, and blocks
of 16 by 16 cells, DAGEVAL was less than 4% slower on one processor than a straightforward
divide-and-conquer implementation, but actually more than 9% faster than the divide-and-conquer
approach on 8 processors.
4.2 DAGEVAL: A dag Evaluation Library
We implemented DAGEVAL for dag evaluation in Cilk++. This section describes the library
interface and the implementation of the eager traversal strategy.
Interface
In DAGEVAL, programmers specify dag computations by creating nodes which extend from a
base DAGNode object, specifying the children of each node, and evaluating the dag by invoking
the evaluate method on the root of the dag.
As a concrete example, consider a dynamic program on an n by n grid, which computes the
value M(n, n) based on an input matrix s, and the following recurrence:
M(ij) max M(i- 1,j) + s(i - 1,j)
M(sy) = max (4.1)M(i,j - 1) + s(i,j- 1) .
Figure 4.3 illustrates how one can express this computation as a dag evaluation. The code con-
structs a dag node for every cell M(i, j) by extending from a dag node class. DAGEVAL supports
different types of traversals of the dag, with the type specified as a template parameter of the
DAGNode class. The programmer uses two. methods of the base DAGNode class: initnode
initializes each node with a specified key and a pointer to a structure wrapping the computation's
global parameters, and add child specifies children for a node.
Implementation
Now we describe the implementation details of DAGEVAL. Once a computation has been ex-
pressed as a dag computation, DAGEVAL provides several traversal options for performing the
evaluation. In this section, we describe our primary strategy, called an eager traversal. Other
traversal options are described in Section 4.4.
Abstractly, an algorithm evaluates a node B by first visiting all of B's children, waiting for those
children to complete, and then calling B's compute method. To evaluate a dag eagerly, DAGEVAL
maintains the following fields for each node::
* Key: A unique 64-bit integer identifier for the node.
* List of Children: Each node has a pointer to an array for the list of children of the node.
* Status: Each node has a status field which changes monotonically, from UNVISITED to
VISITED, then to COMPUTED, and finally to COMPLETED.
* Waiting Parent List Each node D maintains a list of parent nodes B which must be notified
when D has been computed.
* Join Counter: Each node B maintains ajoin counter, whose value is equal to the number
of child nodes B is waiting on.
) ;i ;i ;__~ _r;/i___~~~ ____;( l~ll~_il;_;_i;;;_;_;___Lli;_j~ ;__ _;;; _;; ;_:1~1~_;_;___
class DPDag {
int n; int* s; MNode* g;
DPDag(int n_, int* s_): n(n_), s(s_)
g = new MNode[n*n];
for (int i = 0; i < n; ++i) {
for (int j = 0; j < n; ++j) {
int k = n*i+j;
g[k] .init_node(k, (void*) this ;
if (i > 0) {g[k].add_child(&MNode[k-n])};
if (j > 0) {g[k] .add_child(&MNode[k-l])};
int evaluate() { g[n*n - 1]->compute(); }
};
template <int TraversalType> ;
class MNode: public DAGNode<TraversalType> {
int res;
void Compute() {
this->res = 0;
for (int i = 0; i < children.size(); i++)
MNode* child = children.get(i);
int child val = child->res + s[child->key];
res = MAX(child val, res);} } };
Figure 4.3: Code using the dag evaluation library to solve the dynamic program in Equation (4.1). This code
constructs a dag node for every cell M(i, j).
For some dag evaluation strategies, some of these fields can be omitted. For example, in a serial
depth-first traversal, nodes do not need to store the waiting parent list or the join counter.
Abstractly, the eager traversal strategy visits all nodes recursively. When a worker is visiting
node A, if A's status is UNVISITED, then the worker tries to atomically changes the status to
VISITED. It then tries to visit all of A's children recursively. If the status of A's child, say B
is at least VISITED (due to a visit via some other parent of B), but not yet COMPUTED, then it
increments A's join counter and adds A's to. the list of B's waiting parents. After the worker has
spawned all of A's children, it returns, even though A's value may not have been computed yet.
Later, when another worker finishes computing node B, that worker decrements the join counter of
all the nodes in B's waiting parent's list. If the join counter for any node B becomes 0, that worker
enables A and recursively spawns the compute for A. A node's status changes to COMPLETED
when it has notified all its parents in this fashion.
A more precise implementation is shown in Figure 4.4. In our code, a node B is initialized when
join counter equal to the number of its children.1 DAGEVAL attempts to expand each node B by
spawning TRYVISITCHILD on all its children. The TRYVISITCHILD(B, D) method attempts to
atomically change the status of a child D from UNVISITED to VISITED (line 3); if this change
succeeds, we say that B is the expanding parent of D. Then, TRYVISITCHILD checks whether D
is COMPUTED already; if not, then B must be added to D's waiting parent list.2 Both the addition
of a node to D's waiting parent list (line 9, in TRYVISITCHILD) and the change of D's status to
COMPLETED (line 15 in COMPUTEANDNOTIFY) must be done while holding D's lock in order to
prevent potential race conditions.3
The code in COMPUTEANDNOTIFY starting on line 3 handles the notification of parents. While
D is notifying its parents, more nodes B which are parents of D might append themselves to
D's waiting parent list. Each D maintains a notification counter, tracking which elements on
D's waiting parent list the runtime has already spawned notifications for. When D's notification
counter is equal to the current length of its waiting parent list, then the worker changes status [D]
to COMPLETED.
4.3 Analysis of Performance
In this section, we provide a theoretical analysis of the runtime on P processors of parallel dag
evaluation using an eager traversal strategy.
Definitions
Consider a dag evaluation computation expressed by the dag D, with nodes Dv and edges DE.
Conceptually, each node B in Dv has a list of children children(B), and a list of parents par(B).
Let out (B) and in(B) be the out and in-degrees of the node B in D. For simplicity in stating
1The interface we describe here permits this initialization because B's children are known before traversal of the
dag begins. DAGEVAL also supports an interface where B's children are discovered when B is first visited; this
interface requires increments of B's join counter as children are discovered.
2The waiting parent list is implemented as a concurrent dynamic array, optimized for atomic insertions at the end
of the array.
3We can optimize slightly and potentially avoid a lock acquire (not shown in the code) by first checking if
status [D] is already COMPUTED before trying line 7 of TRYVISITCHILD.
/ . _; iii
TRYVISITCHILD(B, D)
1 enabled +- false
2 if status[D] = UNVISITED
3 then enabled +- CAS status[D] from
UNVI S ITED to VI S I TED.
4 if enabled
5 then spawn EXPAND(D)
6 finished +- true
7 lock(D)
8 if status[D] < COMPUTED
9 then add B to waitingParents(D)
10 finished = false
11 unlock(D)
12 if finished
13 then val -- ATOMDECANDFETCH(j oin(B))
14 if val = 0
15 then spawn COMPUTEANDNOTIFY(B)
EXPAND(B)
1 assert(status[B] = VISITED)
2 assert(join(B)= Ichildren(B)I)
3 for (c E children(B))
4 do spawn TRYVISITCHILD(B, D)
COMPUTEANDNOTIFY(D)
1 COMPUTE(D)
2 status[D] = COMPUTED
3 n +- SIZE(waitingParents(D))
4 notified(D) +- 0
5 while not if ied(D) < n do
6 fori E [notified(D), n)
7 do B +- elmt i of waitingParents(D)
8 val +- ATOMDECANDFETCH(j oin(B))
9 if val = 0
10 then spawn COMPUTEANDNOTIFY(B)
11 notified(D) +- n
12 lock(D)
13 n +- SIZE(waitingParents(D))
14 ifnotified(D)= n
15 then status[D] +- COMPLETED
16 unlock(D)
Figure 4.4: Pseudocode for dag evaluation using an eager traversal.
the results, we assume that the dag has a unique node with no incoming edges (represented by
root (D)), and a unique COMPLETED node: with no outgoing edges (represented by final(D)).
In addition, we define paths (B, D) as the set of all paths in D from node B to node D.
To analyze the runtime of dag evaluation using eager traversals, we analyze executions of
the program in Figure 4.4. For every node B in D, an eager traversal invokes EXPAND(B) and
COMPUTEANDNOTIFY(B) exactly once; however, the exact work performed by these functions
varies, depending on a particular execution. Each possible execution can be represented as an
execution graph (more specifically, dag) S.
We define several notations for subgraphs of an execution graph S. For a particular execu-
tion graph 8 and a dag node B, exp-(B) denotes the subgraph of 8 corresponding to the call
EXPAND(B), comNot (B) is the subgraph corresponding to the call COMPUTEANDNOTIFY(B),
and com(B) is the subgraph corresponding to COMPUTE(B). For any subgraph 8' of an com-
putation dag, we denote the work of the subgraph as W(S'), and the critical path length (span) as
S(S'). We overload notation, and when the superscript 8 is omitted, we mean the maximum of the
quantity over all execution graphs S; for example, W(com(B)) denotes the maximum work for
COMPUTE(B) over all possible executions 8.
To account for costs due to locking and synchronization separately in an execution, for an
execution 8, we define two quantities. Let LT(B, D) be the time spent inside the TRYVISIT(B, D)
function holding locks or performing the CAS operation and atomic decrements. Let Lc (B) be the
time spent inside COMPUTEANDNOTIFY(B) holding locks or on atomic decrements. Again, both
quantities represent maximum values over all executions.
Work of dag evaluation
To calculate the work of a dag evaluation, we first construct (pessimistic) bounds on the time an
eager traversal spends waiting at synchronization operations.
Lemma 4.1 For an eager traversal of D,
LT(B, D) = O(min {out(B) + in(D), P})
Lc(D)= O in(D ,D)E+ min{out(B),P})
(B,D)EDE
PROOF. In TRYVISITCHILD(B, D), we may have contention on the atomic decrement of B's
join counter and on the insertion into wait ingParent s(D). Each decrement of the join counter
can wait at most min {out(B), P} time and the insertion can wait at most min {in(D), P} time.
Similarly, in COMPUTEANDNOTIFY(D),, we may perform an atomic decrement for every edge
(B, D) E DE and this decrement may wait at most min {out(B), P} time. When checking the
size field of D's waiting parent list, we may wait for at most all of D's parents. OE
Lemma 4.2 The work of an eager traversal of D is
( W(com(B))) + 0 (DE
4 (B, Emin {out(B)+ in(B), P})
PROOF. The first term arises from the work of thde compute functions. The second term bounds
the work of traversing D, assuming no contention. The third term covers the contention cost
(Lemma 4.1). I]
Span of dag evaluation
The nondeterministic nature of the computation complicates a direct calculation of S(exp(B)).
Instead, we construct a new, deterministic execution dag S*, whose span is an upper bound on
the span of EXPAND(root). We define the methods EXPAND* (B), TRYVISITCHILD* (B, D) and
COMPUTEANDNOTIFY* (B) to be the same as the original methods, except that all possible recur-
sive calls always occur. In other words, EXPAND* (B) always recursively expands every child of
B and COMPUTEANDNOTIFY* (B) always recursively computes every parent of B. Let exp* (B)
and comNot*(B) be the execution subgraphs corresponding to these modified method calls for
B, and let S* be the execution dag for EXPAND*(root). Since any execution E forms a prefix of
S*, we know S(exp*(B)) > S(exp8 (B) and S(comNot*(B)) > S(comNotE(B)). Figures 4.5
and 4.6 show execution dags for these modified methods.
Lemma 4.3 bounds S(comNot*(B)) and Lemma 4.4 bounds S(exp*(B)).
Lemma 4.3 S(comNot*(B)) is at most
max ((in(X)) + Lc(X) + S(com(X)))}.
pEpaths(root,B) XEp
PROOF. From the COMPUTEANDNOTIFY code in Figure 4.4, we see that comNot*(X) will
enable all parents of a node X, with each recursive COMPUTEANDNOTIFY* happening in parallel
(e.g., see Figure 4.5). It is an upper bound to put all synchronization operations on the span of X.
The proof is by induction on the distance of a node X from root in D. In the base case,
S(comNot* (root)) is bounded by fc(root, c) since computation at root makes no recursive
calls.
In the inductive step, assume the lemma holds for all nodes Y which are at a distance at most
k - 1 from root, and consider a node X at a distance k from root.
From Figure 4.5, we can see that for some constant c,
S(comNot*(X)) < S(com(X)) + c- in(X) + Lc(X)
+ maXAE-par(X) {S(comNot(B))}
Figure 4.5: An execution dag for COMPUTEANDNOTIFY* (D). Numbers correspond to line numbers from
Figure 4.4. Dashed arrows correspond to operations which require synchronization; either an atomic decre-
ment, or a locked section. In the worst-case for span, all in(D)'s parents are added to the waiting parent
list, and the last parent BK added has the maximum span of all of B's parents.
Figure 4.6: Execution dag for EXPAND*(B). Dashed arrows correspond to synchronization operations.
Every child Di of B is recursively expanded, and COMPUTEANDNOTIFY* (B) is called for every child Di
-- - --
The inductive step follows by considering all paths p from root to X, and applying the induc-
tive hypothesis to the prefix of p without X. [
Lemma 4.4 S(exp*(B)) is at most
S(comNot*(final))+
max O(out(X)) + LT(X, Y)
pEpaths(B,f nal) X
XEp (X,Y)Ep
PROOF. The proof is by induction on the distance of a node X from final. In the base case, we
know procExpand*(f inal) only calls COMPUTEANDNOTIFY*(f inal).
In the inductive step, suppose for all nodes Y E D at distance at most k - 1 from final, the
lemma holds. Consider a node X at distance k from f inal.
From Figure 4.6, we see that for some constant c, S(exp(X)) satisfies the recurrence
S(exp*(X)) < c . out(X)
+ max S(exp*(Y)) + LT(X, Y)
Y childrea(X) S(comNot*(X)) + L(X, Y) (4.2)
Using Lemma 4.3, we know that S(comNot*(final)) > S(comNot*(X)) for any X E D.
Then, using the inductive hypothesis, we can substitute the inductive hypothesis for S(exp*(Y)),
bound all COMPUTEANDNOTIFY terms by S(comNot*(final)), and get
S(exp*(X)) < S(comNot*(final))+
+ max max fE(p,c)+c -out(X)+ LT(X,Y).
YEchildren(X) pEpnahs(Yf inal) J
Adding X onto the path p and combining terms completes the proof of the inductive hypothesis.
El
Completion time bounds
Using Lemmas 4.2 and 4.4, and the analysis of a Cilk-like work-stealing scheduler [BL99], we
obtain the following bound for the completion time for eager traversal.
Theorem 4.5 Consider a dag D with maximum degree d and span S(D). With probability at least
1 - c, an eager traversal of D on P completes in time
0 (T 1/P + Too + lg(Pf) + lg(1/c) + L(D)),
where
T B W(corn(B))) + 0 (IDE
Too = max S(com(X)) + O(dS(D))
pcpaths(root,final) Xp
and
L(D) = O( I + dS(D) min{d, P})
PROOF. The proof follows from Lemmas 4.1, 4.2 and 4.4; we bound the indegrees and outdegrees
of nodes by d, and bound expressions which compute maximum over paths p in terms S(D). El
In Theorem 4.5, the terms T1 and Too represent the natural generalizations of work and span
for a dag, taking into account the subcompu:ations within each node. For example, the first term
in T,, is the sum of the spans of all the compute nodes along the longest path in D. The Too term
contains dS(D) instead of just S(D) because each node along the longest path of nodes in D may
scan through its lists of neighbors serially. The L(D) term gives a bound on the contention due
to synchronization during the dag evaluation. The extra factor of min{d, P} appears because we
assume worst-case contention, which hopefully is unlikely to occur in practice. As an important
special case, note that for dags where every node has constant degree, i.e., d = 0(1), the term
L(D) is absorbed by Ti/P + To.
4.4 Experimental Setup
In this section, we describe our experimental setup for comparing eager traversal with other traver-
sals for dag evaluation. First we describe the other dag traversals and then describe other details.
Traversals for dag evaluation
DAGEVAL, described in Section 4.2, supports three different parallel dag traversals - eager,
blocking and helping traversals. We compare these three traversals for our benchmarks.
An eager traversal is the strategy discussed in Section 4.2, and is the default option for DAGEVAL.
A blocking traversal visits each node B by recursively spawning visits to B's children. When
a worker successfully marks B as VISITED, it first tries to recursively visit any of B's children
which are UNVI S ITED. Then, if all of B's children are at least VI SITED, then the worker blocks
until B's status is COMPUTED. Blocking traversal is similar to LOCK-EVALUATE shown in Fig-
ure 4.2.
A helping traversal is similar to a blocking traversal; however, when a worker encounters
an B whose children are at least VISITED, then it picks one of B's children which is not yet
COMPUTED, and recursively tries to help visit that children. In this approach, a node may be
"visited" by multiple workers, but additional synchronization guarantees that the compute of every
node happens exactly once.
Blocking and helping traversals were implemented in a straightforward fashion, using atomic
updates and spin-waiting on the status fields for synchronization. Note that for blocking and help-
ing traversals, a visit of a node B never returns until the value of B has been computed. If the
programmer mistakenly creates graph D with a cycle, blocking traversal may deadlock, helping
traversal may enter an infinite loop, and an eager traversal may terminate without computing all the
nodes.
.... .. .- --- -
Machine configuration
We ran our experiments on two multicore machines, with 8 and 16 total cores, respectively. Our
first machine is a two-socket machine, quad-core (3.16 GHz Intel Xeon X5460) machine with 8
GB RAM. Each processor has 6 MB of cache, shared among the four cores, and a 1333 MHz FSB.
The second machine is a four-socket machine, quad-core (2.40 Ghz Intel E7340) machine with 16
GB RAM. Each processor has 4 MB of shared cache, and a 1066Mhz FSB. Both machines run a
version of Debian 4.0, modified for MIT CSAIL, with Linux kernel version 2.6.18.8. All code was
compiled using the Cilk++ compiler (based on GCC 4.2.4) with optimization flag -02.
4.5 Dynamic Programming Application
Our primary test application is a dynamic programming computation on a 2D grid. In particular,
we consider the dynamic program which computes a value M(i, j) based on the following set of
recursive equations:
E(i,j) = maxkE{o,1,....i,~-} (k,j) + 7'(i - k)
F(i,j) = maxkc{0,1,...-1} A(i, k) + y(j - k)
M(i - 1,j - 1) + s(i,j) (4.3)
(i,j) = max E(i,j)
F(i,:j)
The functions s(i, j) and y(z) can be computed in constant time; in our actual implementation, we
lookup values for s and 7 from tables in memory. This dynamic program is irregular because the
work for computing the cells is not the same for each cell; O(i + j) work must be done to compute
M(i, j). Therefore, in total, computing M(m, n) using Equation (4.3) requires 8(mn(m + n))
work (8(n 3 ), when m = n). As described in [LLSO7], this particular dynamic program models the
computation used for the Smith-Waterman [SW8 1] algorithm with a general penalty gap function
4.5.1 Parallel Algorithms
We explored two types of parallel algorithms for this dynamic program. One type is based on dag
evaluations, and the other uses divide-and-conquer techniques.
DP as a dag evaluation
We can express the dynamic program in Equation (4.3) as a dag evaluation by creating a dag D)
similar to the code in Figure 4.3.4 In order to improve cache-locality and amortize the overheads
of dag nodes, however, we block the grid so that every dag node represent a B by B block of cells
(instead of every node representing a single cell as in Figure 4.3). Block (bi, bj) represents the
block with upper left corner at cell (biB, bjB). In the dag, block (bi, bj) depends on (at most) two
4Although M(i,j) depends on the entire row i to the left of the cell and the entire column j above the cell,
when creating a dag, it is sufficient to create dependencies to M(i, j) only from M(i - 1, j) and M(i, j - 1); other
dependencies are ensured by transitivity.
ComputeM(n) { ComputeMHelper (0, 0, n); }
// Computes M for an n by n grid,
// with upper left corner at (i, j)
ComputeMHelper(i, j, n) {
if (n <=B) { ComputeMBase(i, j, n); }
else {
ComputeMHelper(i, j, n/2);
cilk_spawn ComputeMHelper (i+n/2, j, n/2);
cilk_spawn ComputeMHelper (i, j+n/2, n/2);
cilksync;
ComputeMHelper(i+n/2 j+n/2, n/2);
}}
Figure 4.7: Psuedocode for a parallel divide-and-conquer algorithm to compute M(n, n) for the dynamic
program in Equation (4.3). For simplicity, we only show the code when m = n is a power of 2.
blocks (bi - 1, bj) and (bi, bj - 1). The compute method for each node computes the values of M
for the entire block sequentially. Theorem 4.6 analyzes the span of this computation.
Theorem 4.6 For Equation (4.3), if the size of the matrix is n and the block size is B, an eager
traversal of D has a span of O((n 2B), assuming n > B.
PROOF. The span of D consists of E(n/B) blocks, with a least half the blocks requiring 8 (n 2B)
work. The result follows from the T, term in Theorem 4.5. []
DP as a divide and conquer program
We can also devise a divide-and-conquer algorithm for the dynamic program, as shown in Fig-
ure 4.7. This algorithm divides the grid into 4 sub-grids, and then computes the cells in each sub-
grid recursively. The computations for the two sub-grids along the antidiagonal can be performed
in parallel. We also block the divide and conquer implementation for locality and to provide a large
enough base case to overcome overheads. Theorem 4.9 computes the span of this algorithm (proof
omitted due to space constraints).
To compute bounds on the span of the divide and conque algorithm, let S(i, j, m, n) denote
the span of the ComputeMHelper method. We look at the recursion tree (i.e., call tree) gen-
erated when calling ComputeMHelper (0, 0, m, n). Every node x in this tree represents
some call to ComputeMHelper (i, j, m, n); an x which is an internal node has either 2
or 4 children (corresponding to recursive calls to ComputeMHelper, while an x which is a
leaf calls ComputeMBase. For shorthand, we let S(x) denote the span corresponding to x's
ComputeMHelper call.
To compute an upper bound on S(x), we assign every x in the recursion tree a height h(x),
which is 0 if x is a leaf, or 1 plus the maximum of the heights of its children if x is an internal node.
We claim that S(x) is upper-bounded by the
Lemma 4.7 Let x be a node in the recursion tree corresponding to a call ComputeMHelper (i,
j, m, n). Then for some constant c, S(x) satisfies
S(x) < c - 3h(v) (B 3 . 2h(x) 2+ B (i +j))
PROOF. The proof is by induction on the height of the recursion tree. In the base case, any x
which is a leaf in the recursion tree has height h(c) = 0, and corresponds to some base-case call
ComputeMBase (i, j, m, n). From the code in Figure 4.7, in this call, we know min < B
and n < B. We know the work for the base case is at most
T(i, j,, n) < cmn (n (i+i) ) B3 +B 2 (i + j)
This expression matches exactly the formula above with h(x) - 0.
In the inductive step, suppose for all nodes y with h(y) < k, the lemma on S(y) holds. Consider
a node x at height h(x) = k, corresponding to a call ComputeMHelper (i, j , m, n) . Since
k > 0, x must have either 2 or 4 children. First, consider the case where x has 4 children. Denote
these children as x00o, x 10 , ol, and x 11 , corresponding to the recursive subproblems in Figure 4.7.
The span of these children satisfy the following relationships:
S(Xoo) = S(i,j,m/2, n/2)
S(xio) = S(i +m/2, j ,,r - '/2, n/2)
S(xo1) = S(i,j + n/2, 7n.2, n - n/2) (4.4)
S(xii) = S(i + m/2,j + n/2, m - m/2, - n/2)
S(x) = S(xoo) + max(S(xio), S(xol)) + S(zXi)
Each of the 4 children has height most k - 1; thus, by applying the inductive hypothesis and
collecting terms which are independent of m and n, we have
S(x) < 3c. 3k-1 (B' - 2k-1 B j))
+c - 3k-1 2 ma'x (., ) (4.5)
For any node x at height k, it is straightforward to show that mn < 2kB and n < 2kB, since as
we walk up from the leaves of the recursion tree, each dimension can at most double. Substituting
these bounds into Equation (4.5), we can bound the sum of the last two lines by c - 3kB 3 • 2k- 1.
Adding all these terms together proves the inductive hypothesis for node x at height k. Ol
To compute a lower bound on the span, for simplicity, we consider only the prefix of the re-
cursion tree which is the maximum number of levels of the tree which still forms a complete 4-ary
tree. For each node x in the prefix recursion tree, let g(x) denote the height of x (with leaves having
height 0). We know the root of the tree must have height at least lg(min m, n/B), otherwise, a base
case could not have been reached, and the recursion would divide into 4 subproblems for at least
one more level. Also, any x in the prefix recursion tree at height g(x) must correspond to a problem
with m > 29(,) B and n > 29(x) B2 2
From these results, we can prove a lower bound for S(x) in terms of g(x), in a fashion analogous
to Lemma 4.7.
Lemma 4.8 Let x be a node in the complete prefix of the recursion tree, with height g(x) in this
prefix tree. Suppose x corresponds to a call ComputeMHelper (i , j, m, n). Then for some
constant c, S(x) satisfies
S() 3(x )  29(x) + B2
PROOF. The proof is by induction on the height g(x). In the case where g(x) = 0, x must
correspond to a subproblem of size at least m > and n > .(If the problem was any smaller,
then x's parent should have executed a base case). This subproblem has span which must be at
least the time to execute a block of the same size serially, i.e., at least T(i, j, B/2, B/2). Setting
g(x) = 0 matches the work for T(i, j, B/2, B/2).
In the inductive step, suppose for all nodes y with g(y) < k, the lemma on S(y) holds. Consider
a node x at height g(x) = k, corresponding to a call ComputeMHelper (i, j , m, n). For
the recursion, Equation (4.4) still holds; in place of Equation (4.5), however, we have:
S(x) 3ck-(B3 2k-1 2(i+j)) (4.6)
+ g .3 k-]B2 (m + a + max (m n))4 - 3 .l 2 2 2 I 2
Since we know at height k, m> 2 2k and n > 2k , we substitute into Equation (4.6) to prove the
inductive hypothesis. []
Theorem 4.9 The span of the divide-and-conquer algorithm in Figure 4.7 is e(n'l 6B 3- g6)
O (n2.585B0 .415), where n is the matrix size and B is the block size.
Theorems 4.6 and 4.9 indicate that the parallelism (i.e., work divided by span) of an eager
traversal dag evaluation is O(n/B), but only about ®((n/B)0 415) for the divide-and-conquer algo-
rithm. One can asymptotically decrease the span of a divide-and-conquer algorithm by dividing M
into more subproblems, but the code becomes more complex. In the limit, the resulting parallelism
of divide-and-conquer would also approach .D(n/B).
Implementation
In our experiments, we compared four parallel implementations of Equation (4.3): the three dag
traversals (eager, helping and blocking) and the divide-and-conquer algorithm shown in Figure 4.7.
For a fair comparison, all implementations use the same memory layout, and reuse the same code
for core methods, e.g., computing a single B by B block. For all our implementations, when
computing at a cell M(i, j), we do not parallelize the computation of E(i, j) and F(i, j); instead,
we loop over the prefix of row i and over column j serially. For problems with N < 1000, we found
that the overhead of using Cilk++ parallel constructs in this loop introduced noticeable overhead.
Problems with N > 1000 tend to already have sufficient parallelism, at least for the number of
processors on our test machines.
Since memory layout impacts performance significantly for large problem sizes, we stored both
M(i, j) and s(i, j) in a cache-oblivious [FLPR99] layout. The computations of E(i, j) and F(i, j)
require scanning along a column and row, respectively; thus, simply storing M in a row-major
or column-major layout would be suboptimal for one of the computations.5 To support efficient
iteration over rows and columns, we use dilated integers as indices into the grid [WF99], and
techniques for fast conversion between dilated and normal integers from [RW08].
Experimental results
We ran two different types of experiments on our implementations of the dynamic program. The
first experiment measures the parallel speedup of the four different techniques for various problem
sizes N. The second experiment measures the sensitivity of the eager dag evaluation and the divide-
and-conquer approaches to different choices in block size B.
Speedup of various techniques
In this experiment, we compare the speedup provided by the three dag traversal strategies and
divide-and-conquer algorithm. For this experiment, we fix the block size at B = 16; each dag
node is responsible for computing a 16 by 16 block of the original grid, and the divide-and-conquer
algorithm recurses down to blocks of size 16 by 16.
Figures 4.8, 4.9 and 4.10 shows the speedup on P processors for NE c 1000, 5000, 15000}
on our second machine.6 All the approaches run in virtually the same time on P = 1; the dag
evaluation using an eager traversal has a greater speedup than the divide-and-conquer version,
however. For example, at N = 1000, the divide-and-conquer algorithm achieves a speedup of less
than 5 on 16 processors, while an eager traversal exhibits a speedup of about 14. This result is
not surprising, since the dag evaluation has a higher asymptotic parallelism than the divide-and-
conquer algorithm. As N increases to 5000, both the eager traversal and the divide-and-conquer
algorithm improve in scalability. Locking and helping traversals, however, both appear to achieve
a speedup of no more than 2.
As N increases even more, however, the speedup starts to level off, and eventually decrease.
We conjecture that this slowdown is due to increased data bus traffic and a lack of locality when
computing the terms E(i, j) and F(i, j). In Equation (4.3), if we replace the y- term with indices
which are independent of k, then we see a significant improvement in speedup on N = 15000.
Effect of block size
To measure the sensitivity of eager traversal to block size, we fix N and vary B. Figure 4.11 shows
the results for N = 4000. For small block sizes, we see that eager traversal performs worse than
divide and conquer due to overheads. At small block sizes, in addition to having large computation
overhead for each dag node, DAGEVAL also has significant space overhead. dag nodes are 72 byte
objects (plus pointers to memory that stores children and parent lists). This overhead is significant
if each node only represents a small block which contains only a few integers. As the block size
increases, however, at P = 1, the runtime for eager traversal approaches the runtime for divide and
conquer, and it scales better than divide and conquer for B > 16. 7
5As a point of comparison, the divide-and-conquer algorithm for N = 2000 took about 300 s using a Morton-order
layout, but 460 s using a row-major layout.
6Experiments on our first machine show similar results.
7Preliminary results show that dividing the grid into 25 subproblems instead of 4 narrows, but does not eliminate,
this gap.
Dynamic Program, Intel Xeon E7430: N = 1000, B=16.,. Speedup vs. P
0 2 4 6 8 10 12 14 16
P
Figure 4.8: Dynamic program on an N by N grid (N = 1000 and B = 16). Speedup is normalized to the
fastest run with P = 1 (3.52 s for divide-and-conquer). Results are from the second machine, with 16 cores
total.
Dynamic Program, Intel Xeon E7430: N = 5000, B=16,. Speedup vs. P
0 2 4 6 8
P
10 1(2 14 16
Figure 4.9: N = 5000 and B = 16. Speedup is normalized to the fastest run with P = 1 (442 s for
divide-and-conquer).
.....-
Dynamic Program, Intel Xeon E7430: N = 15000, B=16, Speedup vs. P
Eager Traversal ---
Divide and Conquer --- x---
Locking Traversal ---- -
Helping Traversal --
7 / x ------- x --- -x
5
4
3
2
- f I I I I I I
0 2 4 6 8
P
10 12 14
Figure 4.10: N = 15000 and B = 16. Speedup is normalized to the fastest run with P = 1 (12,376 s for
helping traversal).
O(N 3) Dynamic Program: N = 4000, Time vs. Block Size B
D+C Eager D+C Eager D+C Eager D+C Eager D+C Eager D44C Eager
B=1 B=2 B=4 B=8 B=16 B=32
Figure 4.11: Running time for O(N 3 ) dynamic program for N = 4000, varying block size for base case B.
Results are from the first machine.
We also modified the dynamic program in Equation (4.3) to perform only 0(1) work at cell
(i, j) instead of O(i + j) work. Our results (not shown) indicate that this modified version requires
larger block sizes (e.g., B 2 32) before the eager traversal outperforms the divide-and-conquer
algorithm on 8 processors. Again, this result is not surprising; since each node does less work for
a particular block size, we require larger block sizes to overcome the overheads.
In summary, our experiments indicate that although dag evaluations may suffer from high over-
heads when each node does very little work, in general for this dynamic program, an eager traversal
has relatively small overheads and is competitive with a divide-and-conquer implementation for
reasonably sized blocks.
4.6 Random Dag Microbenchmark
To assess the overhead of the dag evaluation library and to understand its performance on an irregu-
lar application, we constructed a microbenchmark which evaluates randomly constructed dags. We
generate random dags D based on three parameters: d - the maximum outdegree of each node, U -
the size of the universe from which keys are chosen, and W - the work in the compute of each dag
node. D has a single root node B 0 with key 0. Then, iterating over k from 0 to U - 1, we repeat
the following process:
* If D has a node Bk, choose an integer dk uniformly at random from the interval [1, d].
* Create a multiset Sk of dk integers, with each element chosen uniformly at random from
[k + 1, U].
* Remove any duplicates from Sk, and for all k' E Sk, add edges (Bk, Bk') to the dag (creating
Bk' if it doesn't already exist).
In D, each dag node Ak performs W work,, computing kwmodp using repeated multiplication (p is
a fixed 32-bit prime number). The benchmark provides the option of either performing this work
serially, or in parallel (dividing the work in half, spawning each half, and recursing down to a base
case of W = 25).
Experiments
We use the random dag benchmark in three experiments: (1) to measure the overhead of perform-
ing parallel dag evaluation, (2) to compare the various traversal strategies, and (3) to evaluate the
benefits of allowing parallelism inside the computes of nodes.
To measure the approximate overhead for manipulating dag node objects, and for parallel book-
keeping, we construct a medium-sized random dag and vary W. For W = 1, Figure 4.12 shows
that the overhead in our synthetic benchmark can be on the order of thousands of cycles per node,
or hundreds of cycles per edge. We observe that each node generally requires a W on the order of
at least 1,000 to 10,000, before the overhead per node no longer dominates the cost of computation.
In Figure 4.12, the overhead of bookkeeping for eager traversal was between a factor of 2 to 3 times
over the serial traversal when W = 1. As W increases, however, the differences became negligible.
To compare the eager traversal to other traversals, we first created large random dag, with very
little work per node, i.e., W = 1. For eager, blocking, and helping traversals, Figure 4.13 shows the
.. 
.
.. 
..
Eager Traversal Serial Traversal
W T y(s) C cles Cycles Cycles
WIVI WJE| T W(s) -V| WE
1 0.016 3202 583 0.007 1401 255
10 0.017 340 62.0 0.009 180 32.8
100 0.029 58.0 10.6 0.021 42.0 7.66
1000 0.153 30.6 5.58 0.144 28.8 5.25
104  1.379 27.6 5.03 1.370 27.4 5.00
105 13.64 27.3 4.97 13.63 27.3 4.97
Figure 4.12: Evaluation of D with IV| = 15489 and iE| = 85012, for P = 1. 'D was randomly generated
with d = 10, U = 150000 and W = 1.
speedup on P processors over the serial traversal. Eager traversal provides only limited speedup (at
most 2.5) compared to blocking and helping traversals, since eager traversal uses larger dag nodes
and has more overhead for bookkeeping. Also, the locking and helping traversals of a randomly
constructed dag spend relatively little time spin-waiting (compared to the grid in Section 4.5), since
paths through the dag are not likely to overlap.
On the other hand, we can see from Figure 4.14, when each node has a substantial amount of
work to do, then the eager traversal is more scalable than both the blocking and the helping traver-
sals. In this case, the dag has relatively few nodes (only 128). Therefore, if we look at the version
where each node is computed sequentially, the theoretical parallelism is only about 127/29 = 4.37.
The eager traversal exploits most of this available parallelism, reaching a maximum speedup of
more than 4.2. In contrast, the blocking and the helping versions provide a speedup of about 3.6.
More importantly, however, Figure 4.14 demonstrates that to attain the best performance, one
needs to exploit parallelism both at the dag level and in the COMPUTE functions. The eager traversal
with parallelism inside COMPUTE functions shows the best speedup. Parallelism within nodes also
improves the blocking and helping traversals slightly, but not as much as the eager traversal. In fact,
a serial dag traversal with parallelism inside nodes outperforms blocking and helping traversals.
4.7 Future Work
We have described a library, DAGEVAL, which implements eager traversal for parallel dag eval-
uation in Cilk++. DAGEVAL allows programmers to exploit both dag-level parallelism and par-
allelism within the COMPUTE functions. We would like to extend the library to support a more
dynamic interface. In this chapter, we saw an interface which assume the children of a node are
known before traversal of the dag begins. In fact, DAGEVAL contains mechanisms for allowing
children of a node to be discovered on the first visit to a node. One might further extend this in-
terface, however, and allow children of a node B to be added dynamically, after computing the
results of some of B's other children. One direction for future work is to explore what the interface
for specifying such dynamic dags might be, and whether one can support the interface with low
overhead. We would also like to explore other types of applications where using dag evaluations
might simplify parallel programming or improve program performance. Finally, from our dynamic
programming benchmark, we see that the performance of a dag evaluation may be limited by local-
ity and memory bandwidth issues. It is an interesting research question to understand whether one
Random DAG: (IVI, IEl) = (591734, 3255952), Span=147, W = 1, Speedup vs. P
4 ,
1 2 3 4 5 6 7
Figure 4.13: Comparison of eager, helping, and locking traversals for a large random dag with W = 1.
Speedup is normalized over time for serial traversal when P = 1 (0.65 s).
Random DAG: (IVi, jEl) = (127, 614), Span=29, W = 1000000, Speedup vs. P
7 ! I I
Eager, Parallel Compute 
-+-
Eager, Serial Compute --x--
Locking, Parallel Compute ------
6 Locking, Serial Compute aHelping, Parallel Compute ---
Helping, Serial Compute ----
Serial, Parallel Compute --
5 Serial, Serial Compute -
-
4-
3 n -
2
--- -- - - - - - --------- & ---- - A- - -------
0 1
1 2 3 4 5 6 7 8
Figure 4.14: Comparison of eager, helping, and blocking traversals, with and without parallelism in the
COMPUTE function. Speedup is normalized over time for serial traversal with serial compute, for P = 1
(1.12 s).
Eager, Serial Compute -
I nr-n Serial nm "n --- -
U, Vp1Helping, Serial Compute ---- ----
Serial, Serial Compute u. .. /
/
/
/ t
/0
• -- "... . ..t "................... E - " " ........ .... .. ---.... .... ... ...
I I I I I I
2
2.5
1.5
t
can take advantage of locality in a dag evaluation, particularly for dags with an irregular structure.

Chapter 5
Region Helper Locks
As mentioned in Chapter 1, dynamic multithreaded languages often allow for efficient schedul-
ing. In particular, work-stealing provides asymptotically optimal completion time and good space
bounds for programs with fork-join parallelism. Unfortunately, most of these theoretical bounds
do not hold when programs have locks. Consider an example of a concurrent hash table, which is
resized when it gets too full. Normally, two inserts, say Gi and G2, can run in parallel, assuming
G1 and G2 access different buckets. However, if G1 triggers a task H which resizes the table, then
G1 and G2 should not run concurrently. This property is typically enforced by using locks. Us-
ing locks to protect large critical sections often compromises scalability, however. For example, if
worker p is performing a resize operation, all processors that wish to perform inserts must wait for
the resize to finish.
Since the resize H is an expensive operation, one would like to execute it in parallel. In addition,
since the worker p executing an insert G2 must already wait for H to finish, one might want p to
"help" execute the parallel work within H. In this chapter, we introduce helper locks into a fork-
join parallel programming language. We add two types of helper locks, namely region helper
locks and short helper locks. A region helper lock protects a large critical section, called aparallel
region, which contains nested parallelism. In the hash table example, the resize operation should
be enclosed in a region helper lock. A short helper lock is like an ordinary lock which normally
protects a short serial critical section, but is linked to a region helper lock. In our design, when a
worker p blocks while trying to acquire a helper lock f, and some parallel region A either holds
f or the region helper lock linked to £, p can help complete the region A. More specifically, our
contributions are the following:
1. The parallel region lock runtime (PRL), which can execute series-parallel computations
augmented with helper locks and parallel regions. While a parallel region is active, multiple
workers may enter and help complete the region, either because they blocked on the region's
helper lock, or due to random work stealing. PRL supports nested regions of unbounded
nesting depth.
2. Theoretical bounds on the completion time and stack space usage of computations executed
using PRL.
This work was done in collaboration with Charles E. Leiserson and Jim Sukha.
3. Language and runtime support for helper locks in MIT Cilk. The PRL design extends the
Cilk language to allow users to spawn a function as a parallel region, protected by a region
helper lock. The runtime system extension uses "deque pools" and "deque chains" to support
parallel regions.
In our theoretical work, we extend the results given in [ABP98, BL99] for work-stealing sched-
ulers, and show that PRL completes a "deadlock free" computation (with region helper locks) on
P processors in time O (T1/P + Too + PN + lg(1/E)) with probability at least 1 - e, where T
is an "aggregate span" which is bounded by the sum of spans of all regions. Our completion time
bounds are asymptotically optimal for certain computations with parallel regions and helper locks.
In addition, the bounds imply that PRL provides linear speedup if all parallel regions in the com-
putation are sufficiently parallel. Roughly, if for all regions A, the non-nested work of region A is
asymptotically larger than P times the span of A, then PRL executes the computation with linear
speedup. We also show that PRL completes C using only O(P 1S) stack space, where S1 is the sum
over all regions A of the stack space used by A in a serial execution of the same computation S.
As a proof-of-concept, we implemented a prototype of PRL in MIT Cilk, and used the system
to implement a concurrent hash table which uses helper locks to protect resize operations. We
performed some experiments which demonstrate that implementation of PRL is feasible.
The rest of the chapter is organized as follows. In Section 5.1, we motivate the usefulness of
helper locks by discussing the resizable hash table example in greater detail, and we explain the
shortcomings of existing languages such as MIT Cilk for this example. Section 5.2 presents our
design for a work-stealing scheduler which supports parallel regions and helper locks. Section 5.3
proves theoretical bounds on the completion time and space usage for our design. In Section 5.4, we
describe our prototype implementation of parallel regions and helper locks in MIT Cilk. Section 5.5
presents some experimental results on our prototype system for a simple concurrent hash table
benchmark. Section 5.6 discusses related work.
5.1 Motivating Example
In this section, we motivate the utility of region helper locks in a dynamically multithreaded system
by looking at example code for a resizable hash table. In particular, we present code written using
MIT Cilk, an open-source fork-join parallel programming language. First, we briefly review key
features of the Cilk language and runtime. Then, we discuss the challenges of using Cilk with
ordinary locks to exploit parallelism within a hash table's resize operation.
Overview ofMIT Cilk
We review the characteristics of MIT Cilk and its work-stealing scheduler using the sample program
in Figure 5.1. This pseudocode shows a Cilk function that concurrently inserts n random keys into
a resizable hash table.
Cilk extends C with two main keywords: spawn and sync. In Cilk, the spawn keyword
specifies that a function can potentially execute in parallel with the continuation of the function,
i.e., code which immediately follows the spawn statement. The sync keyword forces any code
after the sync to execute only after all previous spawned functions in the function have completed.
) . _ _;_____I___;_;
1 cilk void randinserts(HashTable* H, int n) {
2 if (n <= 32) { random_inserts_serial(H, n); )
3 else {
4 spawn rand_inserts(H, n/2);
5 spawn rand_inserts(H, n-n/2);
6 sync;
7
8
9 void rand inserts serial (HashTable* H, int n) {
10 for (int i = 0; i < N; i++) {
11 int res; Key k = rand();
12 do {
13 res = try insert(H, k, k);
14 } while (res == FAILED);
15 resize table if overflow(H);
16 }
17 }
Figure 5.1: Example Cilk function which performs n hash table inserts, potentially in parallel. After every
insert, the rand_inserts method checks whether the insert triggered an overflow and resizes the table if
necessary.
In Figure 5.1, the rand_inserts function uses spawn and sync to perform n insert operations
in parallel in a divide-and-conquer fashion.
The Cilk runtime executes a program on P workers, where P is specified at runtime. Con-
ceptually, every worker maintains a double-ended queue, or deque, which stores the work of the
processor. When a worker spawns a function f, it pushes the continuation of f onto the bottom
(tail) of its deque and continues working on f. When a worker completes its current work, it pops
work from the bottom of its deque. When a worker's deque becomes empty, however, it chooses
a victim worker uniformly at random and tries to steal work from the top (head) of the victim's
deque.
One can use a reader/writer lock to implement a resizable concurrent hash table in Cilk, as
shown in Figure 5.2. In this code, every insert operation acquires the table's reader lock, and a
resize operation acquires the table's writer lock. Thus, insert operations (on different buckets) are
allowed to run in parallel, but a table resize can not execute in parallel with any insert.
Challenges for parallel regions
In Figure 5.2, one would like to be able to exploit parallelism within the resize operation, e.g., by
spawning the rehashlist operations for each bucket in Line 20. Unfortunately, several factors
prevent ordinary Cilk from efficiently exploiting this parallelism. First, if a worker pl performing a
resize operation holds the table lock in writer mode, any other worker P2 which tries to concurrently
acquire the lock in reader mode will spin waiting for the resize to complete. Cilk currently has
no "suspend" mechanism which allows the worker to temporarily stop execution of one function
(insert) and begin stealing work from another function (resize).
Second, even if one modified Cilk to implement a suspend mechanism, worker P2 is unlikely to
randomly steal work from a resize operation, since work corresponding to rehash_l i st is likely
1 int try_insert(HashTable* H, Key k, void* value) {
2 int success = 0, overflow = 0;
3 success = tabletry_read_ lock(H);
4 if (!success) { return FAILED; }
5 int idx = hashcode(H, k);
6 List* L = H->buckets[idx];
7 success = listtry lock(L);
8 if (!success) {releaseread_ lock(H); return FAILED;}
9 listinsert(L, k, value);
10 listunlock(L); release read lock(H);
11 return SUCCESS;
12
13 void resize table if overflow(HashTable* H) {
14 if (isoverflow(H)) {
15 tableacquire writelock(H);
16 List** new buckets;
17 int newn = H->numbuckets*2;
18 newbuckets = createbuckets(newsize);
19 for (int i = 0; i < H->num_buckets; i++) {
20 rehashlist(H->buckets[i], new buckets, new-n);
21 }
22 free buckets(H->buckets);
23 H->buckets = newbuckets;
24 H->num buckets = new n;
25 releasewritelock(H);
26 }
27 }
Figure 5.2: Code for insert and resize for a concurrent hash table.
to be at the bottom of worker pi's deque and steals occur at the top of the deque. Instead, P2 is
likely to steal work corresponding to a call to randinse rt s which will cause P2 to block again,
suspend, and try stealing again. Thus, this strategy has two drawbacks: it doesn't necessarily lead
to P2 being productive and it may cause p2 to use excessive stack space due to repeatedly blocking
and suspending tasks. The code in Figure 5.1 could use Q(n) stack space, where n is the total
number of inserts, if a constant fraction of the inserts are spawned while a resize occurs.
Finally, it is difficult to implement a deadlock-free design which simultaneously allows workers
to suspend a blocked task and work-steal arbitrarily, and which supports nested locking. In partic-
ular, to prevent deadlock in programs with nested locks, in order to support the above strategy, the
language must support continuations at all points when a function tries to acquire a lock. Without
such support, only the worker who suspends a task can resume it. This requirement can lead to
deadlocks (due to the scheduler), even if the program itself is deadlock-free, that is, the program-
mer acquired locks according to a fixed partial order. For example, say a worker p acquires a lock
f£ which protects the execution of a function F. Then, suppose p inside F fails to acquire a lock £2,
suspends, and randomly steals some other function G. If G also tries to acquire lock l1, we have a
deadlock, since only p can complete F and release £1.
5.2 Design for Helper Locks
In this section, we present the parallel region lock (PRL) runtime, our design for supporting helper
locks in a Cilk-like work stealing runtime system. First, we describe helper locks, which allow a
program to express and exploit parallelism inside critical regions. We then provide an overview
of how we support parallel regions. Finally, we describe how PRL supports nested regions. We
present PRL in the context of the MIT Cilk, the system we used to implement our prototype. The
design can be applied more generally, however, to other fork-join parallel languages which use a
work-stealing scheduler.
Helper locks
In order to support parallelism inside critical sections, we propose adding helper locks to MIT Cilk.
We modify the language to add two types of helper locks. A region helper lock, which protects a
large and parallel critical section, is specified using the construct spawn region. A short helper
lock is like an ordinary lock and normally protects a short serial critical section, but it is linked
to a region helper lock. If a worker p fails to acquire a short helper lock f£ which is linked to a
region helper lock £2, then rather than spinning, it triggers the help region construct for £2.
This construct causes worker p to help complete the critical section currently protected by £2, or
does nothing if £2 is not currently held.
Figure 5.3 shows pseudocode which modifies Figure 5.2 to use helper locks to exploit paral-
lelism within the hash table resize. In Line 1, L is declared as the region helper lock for resize
operation H. If the insert in Line 3 fails because the insert cannot acquire the necessary bucket lock,
in Line 4, the current worker tries to help complete the region protected by L (if the resize lock L
is held). Conceptually, each bucket lock is a short helper lock, linked to L. Finally, in Line 7, the
worker uses the spawn_region construct to try to acquire lock L, and then spawn the function
resize_if_overf low as a parallel region protected by L. The lock L is specified as the first
1 CilkRegionLock* L = H->resize lock;
2 do {
3 res = try_insert(H, k, k);
4 if (res == FAILED) { help_region(L); }
5 } while (res == FAILED);
6 if (is_overflow(H)) {
7 spawn region(resize if overflow) (L, H) ; sync;
8 }
Figure 5.3: Pseudocode for resizable hash table using a helper lock L. This code represents a modification
of the inner loop of the rand_inserts function (lines 12-15 of Figure 5.1).
argument to the region function. If spawn-region detects that L is already held, it implicitly
performs a help_region call for L before trying to acquire L again.
Parallel regions
Conceptually, a parallel region is a subcomputation of a Cilk program, but with its own set of
deques used for scheduling. In the context of helper locks, a parallel region is an execution instance
of a critical region, protected by a region helper lock. Therefore, in the hash table example, each
call to the resize function create a new parallel region, but all are protected by the same lock.
When the runtime system starts executing a parallel region A, it creates a deque pool for A,
called dqpool (A). At any point during its execution, a parallel region has certain workers as-
signed to it, and all these workers have deques in A's deque pool. Initially, when a worker p
spawns a parallel region A, only p is assigned to the region A. As the program executes, more
workers may be assigned to A. When a worker p is assigned to a parallel region A, the runtime
system allocates a deque, dq(p, A), to p from dqpool(A). We say dq(p, A) is NULL if p is not
assigned to A. The runtime uses dqpool.(A) for self-contained scheduling of region A on A's
assigned workers. While p is assigned to A, when p tries to steal work, it randomly steals only
from deques in dqpool(A).
In PRL, a worker can enter (be assigned to) a region A for three reasons. First, pl is assigned
to A when pl successfully spawns region A. Second, pl may enter A when it calls help_region
on a lock e, and A holds lock f. Finally, a worker pi can enter a region A because of random work
stealing: when pi tries to steal from p2, and discovers p2 is assigned to A, pl may also enter A.
Conceptually, workers may also leave a parallel region A before A completes. Allowing work-
ers to leave a region raises several issues, however. First, a worker p might repeatedly leave and
enter the same region A, repeatedly incurring the synchronization overhead of entering and leav-
ing A. In addition, we wish to maintain a property that work belonging to region A remains in
A's deque pool, since otherwise, it becomes difficult for workers assigned to A to find A's work.
Therefore, if a worker p is allowed to leave: A while the deque q = dq(p, A) is not empty, then
p must abandon deque q, i.e., leave it without an assigned worker. If workers repeatedly abandon
deques, then it may be difficult to limit the space used to maintain deque pools. For these reasons,
in PRL, once a worker p is assigned to A, p remains in A until A finishes executing.
Nested regions
We would like PRL to support nested helper locks and parallel regions, i.e., a region A protected
by region lock £l should be able to spawn a region B protected by lock £2. In addition, once
regions can be nested, the following scenario may occur. A worker p may enter a region A due to
a helpregion call on lock £1, and then enter region B due to a helpregion call on lock £2.
By maintaining a deque chain on every worker p, our design is able to support nesting of regions,
with arbitrary nesting depth.
With nested regions, every worker p may be assigned to many parallel regions, and thus have
many deques. In PRL, these deques of one processor form a chain, with each deque along the
chain belonging to the deque pool of a distinct region.' The top deque in every worker's chain
belongs to the global deque pool, which is the original set of deques for a normal Cilk program
context. The bottom deque in p's chain represents p's active deque, denoted activeDQ(p). We
let activeR(p) denote the region for the pool that activeDQ(p) belongs to. When p is working
normally, it changes only the tail of act iveDQ(p). In addition, p always work-steals from deques
within the deque pool of act iveR(p).
Whenever a worker p with active region A enters a region B, it adds a new deque for region B
to the bottom of its chain, i.e., it adds dq(p, B) as a child of dq(p, A), and sets act iveDQ(p) to
dq(p, B). When B completes, for every worker p assigned to A, p removes its deque qp = dq(p, B)
from the end of its deque chain, sets the parent of q, as its active deque, and starts working on the
region that qp's parent belongs to. Note that different workers assigned to B may return to different
regions.
Deque chains also help a worker p efficiently find deque pools for other regions during random
work-stealing. A worker pl with an empty active deque randomly steals from other pools in the
same region (A = activeR(pl)). If pl finds a deque q = dq(p2 , A) which is also empty, but
which has a child deque q', then instead of failing the steal attempt, pi enters the region corre-
sponding to q'. If q' is also empty but also has a child deque, etc., pi can continue to enter the
regions for deques deeper in the chain.
Figure 5.4 illustrates deque pools and deque chains for a simple computation. In this example,
worker 1 enters regions A through F (all regions are assumed to acquire different helper locks).
Initially, worker 1 spawns a region B, worker 4 randomly steals from I in dqpool (A), and spawns
region D, and worker 2 steals from 4 in A, and spawns region F. Next, worker 1 spawns a region
C nested inside B. Worker 3 then randomly work-steals from 1 in A, and enters B and C. Then,
worker 1 inside C makes a help region call on the lock for D, enters D, and steals from worker
4 in D. Finally, worker 1 makes a helpregion call on the lock for F and enters F.
Deadlock freedom
As with ordinary (non-helper) locks, an arbitrary combination of spawn_region and he lpregion
constructs with arbitrary locks can potentially introduce deadlock into a program. As with ordinary
locks, however, maintaining a partial order on region locks is sufficient for PRL to guarantee that a
program is deadlock-free.
1 If a program deadlocks, the chain might contain multiple deques from the same region. One can potentially use
this fact to detect deadlock.
Figure 5.4: A computation dag with regions, and a snapshot of deque pools during execution.
Definition 1 For a program with helper locks, construct a graph G graph of locks, with an edge
from lock fl to lock e2 if a worker in A1 protected by e1 ever directly calls spawnregion or
help_region for region A 2 protected by f2 . We say that a program is deadlock-free if the graph
G it generates is always acyclic.
For programs with (region or short) helper locks which satisfy Definition 1, one can show the PRL
does not introduce deadlock due to scheduling. Given Definition 1, we know the regions along
every deque chain on every worker is consistent with the edges of G. Since random steals only
cause a worker p to enter regions which are deeper in some worker's chain, steals do not introduce
new edges into G. Thus, it is always possible for the worker with the "largest" activeR(p) to
complete its region.
5.3 Completion Time and Space Usage
In this section, we prove bounds on completion time and space usage for a computation with helper
locks executed using PRL.
In order to state our results, we first define some terminology. We model the program as a
computation dag F. Each node in C represents a unit-time task, and each edge represents a de-
pendence between tasks. As in [ABP98], we assume each node in 9 has degree at most 2. We
model regions as subdags of E. The entire computation C itself is considered a region, which en-
closes all other regions. Let regions (E) denote the set of all regions enclosed within E, and let
N = I regions(C) I denote the total number of regions in S. For any node v, we say that a region
A contains v if v is a node in A's dag. We say that v belongs to region A if A is the innermost
region that contains v.
I I II
, t
| !
We assume that regions in F have a canonical structure which satisfies the following assump-
tions. First, the (sub)dag for each region A has a unique initial node called root (A), and a unique
final node called final(A), such that all other nodes in the region are descendants of root(A)
and ancestors of f ina 1 (A). Second, the regions are properly nested; if A contains one node from
B, then it contains all nodes from B. Third, for every region B which is directly nested inside a
parent region A, final(B) has outdegree 1, and the successor node of final(B) (call it v) has
indegree 1. Conceptually, v represents the resumption of A after B completes.
It is useful to prove results only for programs which are free from deadlock. We say that a
computation S is deadlock-free if it was generated by a program which satisfies Definition 1.
For a given region A, the (total) work of region A, denoted T (A), is defined as the number
of nodes contained in A. We define the span of A, denoted To (A), as the number of nodes along
the longest path from root(A) to final(A). The span represents the time it takes to execute a
region on infinite number of processors, assuming all nested regions are eliminated and flattened
into the outer region (and ignoring any mutual exclusion requirements for locked regions). For the
full computation 8, we leave off the superscript and say that Tt = T (8) and span is T, = T(S).
It is also useful to define a work and span for a region A which considers only nodes belonging
to A. We define the region work of A, denoted T- (A), as the number of nodes in the dag which
belong to A. For any path h through the graph $, thepath length for region A of h is the number of
nodes u along path h which belong to A. We define the region span of a region A, denoted o, (A),
as the maximum path length for A over all paths h from root(A) to final(A). Intuitively, the
region span of A is the time to execute A on an infinite number of processors, assuming A's nested
regions complete instantaneously. As an example, in the computation dag in Figure 5.4, region D
has T1 (D) = 13, Tr (D) = 7, T,(D) = 8, and TJ.,(D) = 5. Finally, we define the aggregate region
span:
Definition 2 For a computation S with parallel regions, define the aggregate region span T as
TO = ZAEregions(E) Tc (A).
To start with, we consider only computations without contention on short helper locks, i.e.,
we assume that no worker ever blocks due to a short lock. More precisely, we prove that PRL
completes a deadlock-free computation S with work T, and aggregate span T, on P processors in
expected time O(T 1/P + T, + PN). Moreover, for any c > 0, the execution time is O(T 1/P +
To + PN + lg(1/E)) with probability at least 1 - c. Later, we extend this bound to handle short
helper locks.
We also prove a bound on space usage. Generalizing the notation in [BL99], we define the
region stack depth of a node u and region A, denoted stackDepth(A, u), as the sum of the
sizes of all the frames of u's ancestors (including a itself) in the execution stack, considering only
frames belonging to region A. Thus, stackDepth(A, u) = 0 if u does not belong to region
A. For any region A and computation S, we let S(A) be the maximum over all nodes u E S
of stackDepth(A, u). Let S 1 (A) be the minimum over all possible 1-processor executions of
S(A). Finally, let S 1 = AEregions(E) S 1 (A). We show that for any deadlock-free computation E,
PRL executes S on P processors uses at most O(PS1) space. This space bound is not affected by
the contention on short locks.
Execution
We use an execution model similar to the one described in [ABP98]. Let P denote the number
of processors (workers) being used to execute the computation. A node v is ready if all of v's
predecessors have already been executed. As in the original model of [ABP98], at any a time step
when a processor p has work available, p has a currently assigned node u which is executed. If
executing u makes one other node v ready, then v becomes p's assigned node. If executing u makes
two nodes ready, then u represents a spawn, and one of u's successor's is pushed onto p's deque
and the other successor is assigned. If u does not make any successor ready (this happens when
u's successor v is a sync node and its other predecessor is not done), then p removes the bottom
node w from its deque and assigns it. If p's deque is empty, then in the next step p becomes a thief,
chooses a victim deque uniformly at random, and attempts to steal work from this deque. If the
deque is not empty, p steals the top node x from it and x becomes p's assigned node.
For PRL, in order to support nested parallel regions, we extend this execution model with
regions by incorporating multiple deques and transitions of workers between regions. In the ex-
tended model, worker p's currently assigned node u always belongs to p's active region, A =
act iveR(p). When p works on u, execution proceeds normally as in the original model, except
that when p's deque is empty, it steals work from the deques in A's deque pool. Regions also
introduce a few changes to the model:
1. Say pi in region A steals from a deque q = dq(p2, A). If q is empty (or "blocked", as
described below), and q has a child deque dq(p2, B), then p, enters B by creating a deque
dq(pi, B) and setting it as the active deque.
2. If u represents a help_region call on region B, we say that u is a helper node. Let v be
u's successor node; we call v the region successor of u. Then, v is said to be blocked on B.
This node v is considered blocked on region B until B completes.
3. If the assigned node u has a successor root (B) for another region B (directly nested inside
A) we define u's region successor v as the node v which is the successor of f ina 1 (B), and
again we say that v is blocked on B. Note that by the canonical structure of 8, f inal (B) can
have only one successor node v which must belong to A. Therefore, this case is conceptually
similar to the previous case, except there is a region between u and v.
In our execution model, we conceptually place the blocked node v at the bottom of the deque
dq(p, A), even though it is not ready to be executed. No worker can steal a blocked node v however.
We say that v is invisible to other workers. If some worker p' tries to steal from dq(p, A) and v is
at the top of the deque, then p' is unable to steal v. Instead, p' enters the region B that v is blocked
on. We call a deque dq(p, A) with a blocked node on top a blocked deque.
After blocking on a region B (due to either of the above reasons), p enters B by creating a
deque dq(p, B) in B's deque pool and making it the active deque. In our model, a worker p leaves
a region B only when B completes. When p leaves a region B, it deallocates dq(p, B), and sets
dq(p, B)'s parent deque (say q) as its active: deque. If q has a blocked node u as its bottom node, it
removes this blocked node and assigns it.
Finally, we have to model synchronization costs. When a worker enters or leaves a region,
it has to synchronize with other workers in that region. We pessimistically assume that it has to
synchronize with all other workers. Therefore, we call a step on which any worker is entering or
;i; ;I_/j;jCCI~IT_~_~_1__~T~----ii~i-;-^_(-; i;li~ii.:i~iiiili;ii-ii~-~~ii;i-i-i~-^~i
leaving a region a waiting step. A step when no worker is entering or leaving a region is called a
running step.
In summary, from the description of the execution model, one can prove the invariants in
Lemma 5.1.
Lemma 5.1 Let v, v2 , . . . , k be the nodes on a deque dq(p, A), arranged from bottom to top. Let
u be any predecessor of vi if vi is an unblocked node, or the region predecessor of vi if vi is a
blocked node. An execution of a computation with helper locks maintains the following invariants
for deques:
1. For all i, ui is unique.
2. For all unblocked nodes vi, ui is a spawn node.
3. For all i = 2, 3,... , , ui is an ancestor of u_ i- in S.
4. If dq(p, A) is an active deque with assigned node w, then w is a descendant oful.
5. For all i, vi belongs to region A.
6. If vi is blocked, then i = 1, i.e., a blocked node must be at the bottom of a deque.
7. Any blocked node v which belongs to region A must be in an inactive deque dq(p, A) $7
act i veDQ(p) for some worker p.
PROOF.
Invariants 1 and 2 hold because nodes v are added to a deque only when a worker p works on
an assigned node u representing a spawn, or when an assigned node u pushes its region successor
v as a blocked node onto the deque.
Invariant 5 holds because the execution model only assigns a node u to worker p if u belongs
to the active region act iveR(p), and because a spawn node a can only push nodes belonging to
the same region.
One can show that Invariants 3 and 4 hold by induction on the actions of the execution model.
Invariants 6 and 7 hold because a blocked node is put on the deque dq(p, A) only when p enters
another region B, and entering a region always creates a new active deque. O
Potentialfunction
We shall bound the completion time of a program with parallel regions by using a potential function
argument. In order to define the potential, we first define the depth and weight of nodes. For every
node u belonging to A, we define the depth of ?, called d(u), as the maximum path length for
region A over all paths from root (A) to u. In addition, we define a weight of a node belonging to
region A, denoted by w(A) as the T,(A) - d(A).
As with the analysis in [ABP98], one can show that the weights of nodes along any deque
strictly decreases from top to bottom.
2In [ABP98], depth and weight are defined in terms of an execution-dependent enabling tree. We use this simpler
definition because we are only considering series-parallel computations E.
Lemma 5.2 Let v1 , v2 , ... , Vk be the nodes in any deque dq(p, Ai) ordered from bottom of the
deque to the top. In addition, let vo be the assigned node if dq(p, A) = activeDQ(p). Then, we
have w(vo) W(Vi) < ... < W(k).
PROOF. If the deque is not empty, then either vl is a blocked node, or an assigned node vo exists.
Invariants 6 and 7 from Lemma 5.1 show that these two conditions are mutually exclusive. Define
ui's as in Lemma 5.1.
In the first case, suppose vl is a blocked node and vo does not exist. Since we assume 8 is
a series-parallel dag, the depth of any spawn node u is always 1 less than the depth of its two
children. By Invariant 2, for all the unblocked nodes v2, v3 , ... , vk in the deque, d(ui) = d(vi) - 1.
Similarly, we know that for a blocked node vu, the region predecessor ul satisfies d(ul) = d(vi) -1.
By Invariant 3, we know that ui is an ancestor of ui- 1 in S; thus d(vil) > d(vi) for all i
2,3, ... , k. Converting from depth to weight, we get w(vi) < w(v 2 ) < ... < W(k).
In the second case, suppose vo does exist (and thus, the deque in question contains only un-
blocked nodes). Applying the same logic to the nodes on the deque as in the first case, we have
d(vi-1) > d(vi) for i = 2,3,..., k. By Invariant 4, we know vo is a descendant of ul; thus,
d(vo) > d(ul) = d(vi) - 1. Thus, d(vo) >: d(vi). Converting from depth to weight reverses the
inequalities, and we get w(vo) < w(vl) < ...- < W(k). [
As in [ABP98], we use the weight to define a potential function for ready or blocked nodes u,
and for a region A. We define the potential of a node u belonging to region A, denoted by (D(u),
as (u) = 32 (  if u is assigned and 3 2 ((u) if u is ready or blocked. The potential of an active
deque dq(p, A) = activeDQ(p) is defined as the -vCdq(p,A) D(v)+ (u), where u is p's assigned
node. The potential of an inactive deque dq(p, A) is defined as Evedq(p,A) 4 (v). The potential of
a region, 4 (A), is defined as the sum of all the potentials of all the deques in A's deque pool.
Lemma 5.3 In the course of the computation, the potential of a particular region P (A) increasesfrom 0 to 327r0 (A) when root (A) becomes ready. At all other times, there is no potential increase.
PROOF.
The scheduler performs four actions that can change the potential. First, it may remove a node
u (belonging to A) from the deque and assign it. Second, it may executes an assigned node u, which
makes one or two children nodes from the same region A ready. In both these cases, the potential
of only region ( (A) is affected, and it always decreases by the argument given in [ABP98].
The third action is that a worker p may execute a node u from region A and enable (exactly
one) node w = root (B) from region B. In our execution model, the node v which is the region
successor of u blocks and is added to dq(p, A). Since v belongs to A, D(A) decreases because v is
a successor of u and has a lower potential. At the same time, dq(p, B) becomes a new active deque
for p and root (B) becomes ready, and the 4 (B) increases from 0 to 327(B)
The fourth and final action occurs when a worker p executes the final node of a region, f inal (B),
and enables a previously blocked node v belong to A. In this case, the potential of B decreases
to 0. Since v is a blocked node, changing v: from blocked to assigned results in a decrease in the
potential P (A). Ol
100
i ~___lj~i___;_:j_____i~_ll_____l_;___l__
Bounding steal attempts
We say that a steal attempt occurred in region A if the thief's active deque is in A's deque pool. In
addition, we define a counted steal attempt as a steal attempt which occurs on a running step. To
prove a bound on completion time, we need to bound the total number of counted steal attempts
in all regions. We first bound the number of steal attempts in each region separately. Later, we
aggregate the steal attempts from different regions.
To bound the number of counted steal attempts which occur in a region A, we divide steal
attempts in A into phases. Intuitively, a phase R!,(A) in region A consists of O(P) counted steal
attempts in region A. We classify phases as either "entering" or "contributing" phases. In our anal-
ysis, we amortize contributing phases against the potential of region A, and amortize the entering
phases against the number of regions N.
We first define the boundaries of phases more precisely. The first phase R 1 (A) in region A
begins when root (A) is assigned to some worker. A phase Rk(A) ends after at least P counted
steal attempts in A have occurred, or some p executes final(A). For k > 1, we call a phase
Rk (A) an entering phase if at the beginning of the phase, one of the deques in the A's deque pool
is either blocked, or empty and inactive. Otherwise, we call Rk (A) a contributing phase. Note that
any phase can have at most 2P - 1 counted steal attempts, if P counted steals occur in the last time
step of the phase.
First, we bound the number of contributing phases for a region A. At the beginning of phase k
for A, let Ek(A) be the sum of potentials on the due to empty or blocked deques in A's deque pool,
and let Dk (A) be the potential on all other deques. Therefore, the potential of A at the beginning
of phase k is k(A) = k(A) (A) + Dk(A).
Lemma 5.4 If Rk(A) is a contributing phase for A
Pr { k+1(A) - (k(A) > (k(A)/4} > 1/4.
PROOF. First, we bound the potential decrease in Dk(A). By definition, every phase (except
possibly the last phase) has at least P steal attempts. Thus we can apply Lemma 8 from [ABP98]
directly to claim that Dk(A) decreases by a factor of 1/4 with probability 1/4. If Rk(A) is a
contributing phase, then at the beginning of phase k, A has no blocked deques, and any empty
deques in A are active. For any empty and active deque q, the worker p assigned to q either had no
assigned node (and thus contributes nothing to Ek (A)), or had an assigned node u which executes
during the phase (since each phase contains at least one running time step), thereby reducing q's
contribution to Ek (A) by more than 1/4. The lemma holds trivially for the last phase. O
Lemma 5.5 For a region A, the expected number contributing phases is 0(7: (A)).
PROOF. The proof is analogous to Theorem 9 in [ABP98]. Call phase k successful if Dk+l (A) -
Dk (A) > k (A)/4, i.e., the potential decreases by at least 1/4 fraction. From Lemma 5.4, we know
Pr {4k+l(A) < 34k(A)/4} 2 1/4, i.e., a phase is successful with probability 1/4. The potential
for region A starts at 3 2r- (A), ends at 0, and is always an integer. Thus, a region A can have at
101
most 8T- (A) successful phases. Thus, the expected number of phases needed to finish A is at most
32T-(A). [l
We now bound the number of entering phases. Note that at the beginning of an entering phase
Rk(A) for region A, there exists some deque q = dq(p, A) which is empty-and-inactive or blocked
(and inactive). Let B be the region such that: dq(p, B) is the child deque of q in the deque chain of
p. We consider Rk (A) a successful entering phase (for A) if during this phase, either B finishes,
or some active worker p' in A enters B.
Lemma 5.6 Any entering phase Rk (A) is successful with probability at least 1/2. In addition, the
expected number of entering phases in a computation is at most 4PN.
PROOF. Let q be any deque which is empty-and-inactive or blocked at the beginning of phase
k, and let B be the region such that dq(p, B) is the child deque of q. Until dq(p, B) finishes,
any worker who tries to steal from q will enter into region B and the phase is successful. Also,
if B finishes during the phase, then the phase is automatically successful. Thus, the phase can be
unsuccessful only if B does not finish, at least P steal attempts occur, and all fail to hit q. Since
each steal attempt hits q with probability 1/P, the probability that none of the steal attempts hit
q is (1 - 1/P)P < 1/e < 1/2. Therefore,, with probability at least 1/2, the entering phase is
successful.
Each worker can enter a region A at most once, since the scheduler does not allow workers to
leave A until A is finished. Therefore, the total number of successful entering phases which cause
a worker to enter any region is at most PN. Similarly, when a B finishes, since B has at most
P deques, it can cause at most P successful entering phases for for other regions. Therefore, the
total number of successful entering phases is at most 2PN. Since each phase is successful with
probability 1/2, the expected number of entering phases is at most 4PN.
LD
Using Lemmas 5.5 and 5.6, we can bound the total number of counted steal attempts for a
computation E.
Lemma 5.7 Using PRL, the expected number of counted steal attempts in entering phases is at
most O(PToo + P2 N). In addition, the number of counted steal attempts is O(PTo + P 2N +
Plg(1/E)) withprobability 1 - e.
PROOF. Summing over all regions A, the expected number of phases is 32T, +4PN. Each phase
contains at most 2P - 1 counted steal attempts. Therefore, the expected number of counted steal
attempts is at most 64PTo + 8P 2N. For the high probability bound, say the execution takes more
than n = 32To + 8PN + m contributing phases to A. Since each phase succeeds with probability
at least 1/4, the expected number of successes is at least 8To + 4PN + m/4. We can then use
Chernoff bounds as in [ABP98] to prove the result by choosing m = 32T,, + 16 ln(1/E). OI
Completion time bound
We are now ready prove Theorem 5.8, a bound on the completion time of a computation S.
Theorem 5.8 The parallel regions work-stealing scheduler completes a deadlock-free computation
S with work T and aggregate span To on P processors in expected time O(T1/P + To + PN).
102
i ~i ____~~_lli;_~; _ _~_~i___ __;ILj;*tiiii/;i ;I~__;; / :(;l:ilij_:;j/jl;i;i~i~- i...ii.i.-:-il.;. . i-.i..-l.._.___r_;.i^..-_._.-.--_
Moreover, for any c > 0, the execution time is O (T 7/P + T + PN + lg(1/c)) with probability at
least 1 - E.
PROOF. We bound running and waiting steps separately. On running steps, no worker is entering
or leaving a region.
On a running time step, a worker p can take one of two actions. First, p can be executing a node
v which is ready. Second, p can be trying to steal from a deque in the pool of its active region,
act iveR(p). In the first case, the total number of steps that workers can spend executing ready
nodes is T 1. In the second case, by Lemma 5.4 and Lemma 5.7, the expected number of steal
attempts is PT. + P2 N. Since there are P workers active on every step, the expected number of
running steps is O(T 1/P + T + PN). The high probability bound also follows similarly.
To bound the number of waiting steps, we notice that a worker can enter and leave a region A at
most once. Thus, there are at most O(PN) waiting steps, even if each worker enters every region.
O
Discussion of bounds
To understand Theorem 5.8, we can compare it to the completion time bound for a computation
without regions. The ordinary bound for randomized work stealing [BL99] says that the expected
completion time is O(T 1/P + T,). In Theorem 5.8, there are two differences.
First, there is an additive term of PN. If the number of parallel regions is small, then this
term is insignificant compared to the other terms in the bound. Since parallel regions are meant to
represent large critical sections, we expect N to be small in most program,. For example, in the
hash table example from Section 5.1, if we performs n inserts, we expect to have only O(lg(n))
parallel regions for resizes. Furthermore, even if there are a large number of parallel regions, if the
each parallel region A is sufficiently big (Ti(A) = Q(P 2 )) or long (,,(A) _ Q(P)), then the term
PN = O(T 1/P + T), i.e., the PN term is asymptotically absorbed by the other terms. In general,
we expect small critical sections to protected using short locks, not region locks.3
Second, Theorem 5.8 has the term To, instead of T,. One can understand both the work-
stealing bound and the parallel regions bound in terms of parallelism. The ordinary work-stealing
bound means that the program gets linear speedup if P = O(T /T,). That is, the program gets
linear speedup if the parallelism of the program is at least Q(P). We can restate the PRL bound as
follows:
o + To + PN = O +Toe (A) + PN
AE regions() ((
Using this restated bound, one can see that (ignoring the PN term), PRL provides linear speedup if
the parallelism of each region is at least Q(P). In addition, if the number of parallel regions is small
(as we expect) then the term T. is generally small as well. In the hash table example, a region A
which can resize an array of length k completely in parallel would ideally have wT(A) = O(lg(k)).
Then, for n inserts, T, = O(1g2 (n)), whereas the total work is 9(n).
3The PN bound is overly pessimistic for small parallel regions, since it assumes that all processors enter all critical
sections. In fact, it is very unlikely that all processors will enter a small parallel region, and a small critical region will
generally execute sequentially as if it was a short lock.
103
Time bound with short helper locks
We now add the contention on short helper locks into the analysis. In general, a short helper lock
functions like a normal lock, and it is difficult to get interesting contention bounds on computa-
tions with locks, even in the absence of parallel regions. Here we present a simple bound just for
completeness.
We define the bondage b of an execution graph S as the total number of nodes enclosed
within short locks in S. The bondage represents the amount of computation enclosed within short
locks. Since short helper locks are designed to protect small critical sections, we assume that
spawn_region or helpregion calls never occur inside these critical sections. In our execu-
tion model, when a worker p holds a short lock f, any other worker which tries to acquire £ waits
for p to release £ if no parallel region holds the region lock linked to f. We assume worst case
contention and say that whenever a short lock £ is held by worker p, all other P - 1 workers are
waiting on f. Thus, we consider any time step when any worker is waiting on a lock to be a waiting
step (as we did earlier when a worker was entering or leaving a region). Then, we can bound the
number of waiting steps due to short locks as. follows:
Lemma 5.9 For a deadlock-free computation, the total number of waiting steps due to short locks
is at most b.
PROOF. In any waiting step due to short locks, at least one worker is holding a short lock and is
therefore executing some node that is enclosed within a short helper lock. Therefore, the remaining
bondage (the number of unexecuted nodes enclosed in short helper locks) decreases by at least 1
after each waiting step due to short locks. There can be at most b such waiting steps. Ol
One can bound the completion time for a computation with both region and short helper locks
by adding b to the bound in Theorem 5.8. Even though this bound seems simple, there exist com-
putations for which this bound is asymptotically optimal. For example, if all parallel regions are
protected by the same lock £l and all short lock acquires are for the same lock £2, then asymptoti-
cally, one cannot do better than the bound we give (ignoring the PN term).
Space
Now we prove Theorem 5.10, which bounds the stack space.
Theorem 5.10 Let S1 =-CAEregions(E) S (A). For a deadlock-free computation 8, PRL executes
& on P processors using at most O(PS1) space.
PROOF. At any point fixed in time, consider the tree T of active frames for the entire computation
E. For any region A, let TA be the subset of T which consists of only frames which belong to A.
The only time a worker p stops working on its current active frame f in region A without
completing f is if p enters a new region B. Also, a worker p can only enter a region A once. Using
these two facts, one can show that the set TI is in fact a tree with at most k leaves, where k is the
number of workers which have entered A. This fact is a generalization of the busy-leaves property
from [BL99], applied to only nodes of a specific region A. Thus, A uses at most O (kSI (A)) stack
space for TA. In the worst case, all P workers have entered every region A e regions(8), and
we must sum the space over all regions, giving us a space usage of Si = EAeregions(8) S1 (A). Ol
/ 
_ _~___ _ II_. .... .. __
Comparison with alternative implementations
We now compare the bounds of our implementation of parallel regions with 2 other alternatives.
The first option does not allow helping. When a worker p blocks on a lock £, it just spins or waits
for £ to become available. Using this traditional implementation for locks, the completion time of
a program with critical sections (either expressed as parallel regions or just expressed sequentially)
can be Q (T/P+E Aeregions(S) 71 (A)+ b). Therefore, if the program has large (and highly parallel)
critical sections (as in the hash table example), then this implementation may run significantly
slower than when one uses helper locks.
Second, we can compare against an alternative we describe in Section 5.1, where if a worker p
blocks on a lock, it suspends its current work, and then work steals normally. The implementation of
this design may be easier than the one we propose in this chapter, since it does not need deque pools
and chains of deques. Each worker maintains just one deque. As we mentioned in Section 5.1, this
implementation may deadlock in the scheduler unless all region and short locks have continuations.
However, even if each lock has continuations, this implementation still has the disadvantage that it
may use more space than our design. In fact, even for programs with just one parallel region (and
many short locks), this implementation may use 2(T) space. In contrast, for one parallel region,
our implementation uses O(PT,) space, which is much smaller than T for reasonable parallel
programs.
5.4 Prototype Implementation
This section describes our implementation of parallel regions and helper locks. We based our
prototype on MIT Cilk, since it is readily available as open source[Sup03]. First, we review the
existing implementation of deques in Cilk [FLR98]. Then, we discuss deque chains and deque
pools, the two major additions to the runtime system needed to support parallel regions.
Cilk deques
Each Cilk deque is represented by pointers into a shadow stack, a per-worker stack which stores
frames corresponding to Cilk functions. For example, for the code in Figure 5.1, when a worker
executes a call to randinserts (H, 63), and executes the spawn in Line 4, it begins ex-
ecuting the first call to rand inserts (H, 31), and pushes a frame for the continuation of
rand_inserts (H, 63) on to the shadow stack, marked so that some other worker can later
steal this frame and resume execution at Line 5.
Each deque consists of three pointers that point to slots in the shadow stack: a tail pointer T, a
head pointer H, and exception pointer E. The runtime uses the THE protocol described in [FLR98]
to manage deques. T points to the first empty slot in the stack; when a worker pushes and pops
frames onto its own deque, it modifies T. H points to the frame at the top of the deque; when other
workers steal from this deque, they remove the frame pointed to by H and decrement H. Finally,
the exception pointer E represents the point in the deque where the worker should not pop above;
if a worker working on the tail end encounters 1 > T, some exceptional condition has occurred,
and control returns to the Cilk runtime.
105
dq(p,A)
dq(p,C)
C
Figure 5.5: A chain of deques for a given worker p. Each deque dq(p, A) consists of pointers (tail, head,
and exception ) into the p's stack of Cilk frames. For clarity, we show a pointer for the base of each deque q;
in practice, q's base is always equal to the tail pointer of q's parent deque.
Deque chains
To implement parallel regions, we must conceptually maintain a chain of deques for each worker p.
Usually, p modifies only its the bottom, active deque; however, other workers must be able to steal
from any deque along p's chain. For example, in Figure 5.4, worker 1's deque chain has 6 deques,
one for each region. Worker 2 can steal from dq(1, F), while worker 4 can steal from dq(1, E).
One straightforward implementation of deque chains is to allocate a separate shadow stack for
each deque. Such a scheme, however, might allocate many stacks (i.e., PN instead of P). Dy-
namically allocating stacks every time a worker enters a region is potentially expensive. Statically
allocating the correct number of shadow stacks is tricky because N might vary depending on an
execution.
Instead, PRL maintains the entire chain of deques for a given worker p on the same shadow
stack, as shown in Figure 5.5. Each deque for a given worker p maintains its own THE pointers,
but all point into the same shadow stack. When a worker enters a region, it only needs to create
the THE pointers for a new deque, and set all these pointers equal to the to the tail pointer for the
parent deque in the shadow stack. The correctness of this implementation relies on the property that
two deques in the same shadow stack (for a worker p) can not grow and interfere with each other.
This property holds for two reasons. First, for a deque, the head H never grows upwards since
H only changes when steals remove frames. Second, only the tail pointer T of the active deque
activeDQ(p) grows downwards, and activeDQ(p) is the bottom deque on the deque chain.
Figure 5.5 shows an example arrangement of a deque chain on a worker's stack. In this example,
no worker has stolen any frames from region B, while two frames have been stolen from region C.
Worker p's deque for region E is empty, and act iveR(p) = F.
Deque pool implementation
In our prototype, we implement a deque pool as a single array of deques (i.e., THE pointers). An
array of P slots is statically allocated when the user creates a region lock. When a region is active,
one or more slots of this array are occupied by workers.
To enter the pool, workers try to reserve the next available slot i in the array by using a
106
I I III
compare-and-swap operation to increment a size field from i to i + 1. When a worker p tries a
spawn_region call specifying a region helper lock f, it succeeds in acquiring £ only if acquires
slot 0 in the array. A worker which manages to reserve slot 0 is considered the master worker
thread for the region. A p that tries a he lp_region call succeeds in adding itself as a helper to
the region if it acquires a slot i > 1 in the array. When the region completes, each worker clears its
deque in the lock's deque pool and increments a counter to signal its departure. The master worker
thread waits until the last worker has left before it releases the lock (i.e., sets the size field to 0).
Work stealing occurs within a region by choosing a slot in the deque pool uniformly at random,
and trying to steal from that deque.
Our current implementation uses a simple locking protocol which associates a lock with ev-
ery deque. A worker must lock a deque before trying to steal from it. When a worker p with
activeDQ(p) = dq(p, A) tries enters another region B, it locks dq(p, A), creates dq(p, B) as
dq(p, A)'s child deque, locks dq(p, B) and then releases the lock on dq(p, A) and finally releases
the lock on dq(p, B). Similarly, when a worker p leaves a region B, it first locks the parent of
its active deque dq(p, A), then locks dq(p, B), deallocates dq(p, B) and then releases the lock on
dq(p, A).
5.5 Hash Table Benchmark
This section presents some experimental results for an implementation of a resizable hash table,
using our prototype of PRL. Although our hash table implementation is not optimized, we present
these results, to show that implementation of PRL is feasible.
Hash table implementation
Our hash table table maintains an array of pointers for hash buckets, with each bucket implemented
as a linked list. The table supports search, and an atomic insert _i f _absent function. A search
or insert locks the appropriate bucket in the table. When an insert causes the chain in a bucket
to overflow beyond a certain threshold, it atomically increments a global counter for the table. An
insert triggers a resize operation when more than a constant fraction of the buckets have overflowed.
A resize operation sets a flag marking that a resize operation is in progress, and then acquires
all the bucket locks, scans the buckets to compute the current size of the table, doubles the size of
the table until the density of the table is below a certain threshold, allocates a new array for buckets,
and rehashes the elements from the old buckets into new buckets. We parallelized the lock acquires,
the size computation, and the inserts into the new table.
We implemented two flavors of the hash table; the first flavor performs the resize operation
serially, and the second spawns the resize operation as a parallel region, protected by a resize
region helper lock. Each bucket lock functions as a short helper lock linked to this resize lock; if
the resize lock is held, then an attempt to acquire a bucket lock cause workers to help resize.
Our benchmark performs n insertifabsent operations on the hash table by spawning
20 functions, with each function performing n/20 inserts serially.4 Keys are chosen uniformly
at random, based on a deterministic seed chosen for each of the 20 functions. Since keys k are
4The number 20 is chosen arbitrarily, to be a number larger than P.
107
10 Million RandObm Isertions
Serial Resize, IniloirSze =10 Million Buckets --+-
Helper Locks, Initliel(ze= 10 Million Buckets ----
Helper Locks Initial Size = 10 Buckets ---)--
Serial Resize; Initial Size = 10 Buckets --.s
3.5
2.5
2-
. ..... ..... . . 
...... 
. . .
1.5
2 3 4 5. 6 7
P
Figure 5.6: Results from an experiment inserting n = 107 random keys into a concurrent resizable hash
table. Speedup is normalized to the runtime of the hash table with the same initial size and serial resize for
P = 1. For tables with initial size of 10 and n buckets, the runtime was 8.66 s and 3.60 s, respectively.
Each data point represents the average of 5 runs with the same parameters. This experiment was run on a
two-socket machine, quad-core (3.16 GHz Intel Xeon X5460) machine with 8 GB RAM.
random, the benchmark uses a simple hash function of k modulo the number of buckets in the hash
table.
Experimental results
Figure 5.6 shows results from performing insertions into the resizable hash table. We ran two
versions of the experiment, one which contains no resize operations and one that does. In both
experiments, the number of inserts is n = 107.
In the first experiment, the table began with 107 buckets; thus, with 107 inserts, no resize oper-
ations were triggered. In this experiment, the implementations with both serial and parallel resize
were comparable, and both provided a speedup of about 3.5 on 8 processors. These results indicate
that the overhead of for using a region lock in this application is relatively small. In the second ex-
periment, the table began with 10 buckets, and the table size repeatedly doubled on resizes. In this
case, the implementation which uses a parallel resize operation with region locks provided speedup
of about 3. In contrast, the implementation that used the serial resize provided speedup of at most
2. This experiment indicates that there is some potential advantage to using region locks.
Note that in Figure 5.6, the plots for the two experiments (with and without resize) are not
directly comparable to each other. The serial table which did not resize ran about 2.4 times faster
than the serial hash table which does resize. This additional factor is approximately consistent with
the amortized cost of table doubling. Conceptually, every insert in a resizable table pays 3 times
the cost of a normal insert: once for the original insert, once to move it when the table is expanded,
and once to move another item which has already been moved once.
Our current hash table implementation does not appear to scale beyond 4 processors, even when
108
no resizes occur. Thus, additional work is needed to improve scalability in this benchmark. In
practice, a program will hopefully have a more realistic and scalable mix of concurrent operations
(i.e., a combination of searches and inserts). Our primary goal for the benchmark, however, was to
test and evaluate the feasibility of our design; thus, we tried to trigger resize operations as often as
possible.
5.6 Related Work
In this section, we briefly review some related work.
OpenMP uses a parallel construct to support nested parallelism [Boa08]. Our implemen-
tation has some similarities with the implementation of this construct, although the design goal
is different. Like in OpenMP, every parallel region in our implementation has a master (worker)
thread, which is the first to enter the region, and which is guaranteed to resume execution after the
region completes.5 In OpenMP, the number of workers for a region is fixed when the region begins.
In PRL, however, additional workers can enter the region, either through random work stealing, or
because they are blocked on the lock for the region.
Cooperative techniques, where one thread helps another thread complete its work, have previ-
ously been proposed in a variety of contexts. In the context of nonblocking algorithms, researchers
[Bar93, TSP92, IR94] describe algorithms where threads cooperate to complete an operation when
they would otherwise block for synchronization. In the area of databases, Lim, Ahn, and Kim de-
scribe [LAK03] a concurrent B"nk tree algorithm which uses cooperative locking to handle nodes
with concurrent underflow.
5We made this decision for convenience in implementation. Our PRL design does not require a master thread.
109
110
Chapter 6
Memory Models for Transactions
This chapter describes a framework for defining and exploring memory semantics of transac-
tional memory (TM) systems and mechanisms. This framework is inspired by the computation-
centric framework proposed by Frigo [Fri98, FL98], allows TM semantics to be specified in an
implementation-independent way. We will use this framework to both define existing TM memory
models such as serializability, and to explore new memory models for TM namely race freedom
and prefix-race freedom using this framework. We also use this framework to prove properties of
these memory models.
In subsequent chapters (Chapters 7, 8, and 9), we continue to use this model to both describe
new TM designs and to prove semantic properties of these designs. For example, in Chapter 7,
we use the framework to describe the operational semantics of open-nesting and to prove that it
provides the memory model we call prefix-race freedom. Similarly, in Chapter 8 we use the frame-
work to prove that ownership-aware transactions provide the memory model we call serializability
by modules.
This chapter is organized as follows: Section 6.1 provides some background on transactional
memory and nesting. Section 6.2 provides our framework for describing TM semantics. Sec-
tion 6.3 defines sequential consistency using our framework and then generalizes this definition in
the transactional setting. Section 6.4 formally defines the memory models of serializability, race
freedom, and prefix-race freedom. In Section 6.5 we prove that all three memory models are equiv-
alent for computations with only committed transactions, but are distinct when we model aborted
transactions or have open transactions. Section 6.6 explores some related work on semantics of
TM.
6.1 Background and Motivation
Atomic transactions represent a well-known and useful abstraction for programmers writing paral-
lel code. Database systems have utilized transactions for decades [GR93]. Typically, serializability
[Pap79] is used as a correctness condition for transactions. Under memory model of serializability,
transactions affect global memory as if they were executed one at a time in some order, even if in
reality, several executed concurrently. Transactional memory with either flat or closed nesting still
guarantees serializability.
This work was done in collaboration with Charles E. Leiserson and Jim Sukha [ALS06].
111
Open nesting provides a loophole in the strict guarantee of transaction serializability by allow-
ing an outer transaction to "ignore" the operations of its open subtransactions. Moss [Mos06] de-
scribes open nesting as a high-level methodology that incorporates an open-nested commit mecha-
nism, where the commit of a nested transaction I globally commits the memory locations accessed
by it (as described in Section 1.2. In addition to the behavior of the commit mechanism, the
methodology of open nesting requires high-level constructs such as abstract locks and "compensat-
ing actions" [Mos06]. For example, if a nested transaction I (nested inside transaction A) commits,
and A later aborts, a compensating action for I may be executed in order to undo the changes made
by I. Therefore, generally, open nesting can be viewed at two levels. At the memory level, open
nesting is described by the open-nested commit mechanism. At the program level, it consists of
other mechanisms that are involved in the methodology.
Indeed, even TM without any nesting can be viewed at two levels of abstraction. For example,
the hardware may implement rollback of memory state, but rely on the programmer or compiler to
retry transactions that abort, sometimes using backoff protocols to ensure that a given transaction
eventually commits. Thus, it is helpful to distinguish the memory model for TM, as the essential
memory semantics that the hardware (or the basic software) implements, from the program model,
as the semantics that the programmer sees.
In this chaper, our focus will be on memory models for TM. We shall not concern ourselves
with retry mechanisms, compensating transactions, and the like. A TM system should have well-
specified behavior even as a target for compilation, when all program-level support for transactions
and nesting are put aside. Low-level software may build upon the memory model to provide a
higher level of abstraction, e.g., for open nesting, but the semantics of open nesting must be un-
derstood by the programmers of this low-level software. Moreover, although one may ignore the
semantics of aborted transactions at the program-model level, at the level of the memory model,
even aborted transactions must have a reasonable semantics, at least up to the point where they
abort. Thus, we shall be interested in defining memory semantics even for aborted transactions.
This chapter describes a framework for defining transactional memory models. Our frame-
work, which is inspired by the computation-centric framework proposed by Frigo [Fri98, FL98],
allows TM semantics to be specified in an implementation-independent way. Within this frame-
work, we define the traditional model of serializability and two new transactional memory models,
race freedom and prefix-race freedom. We prove that these three memory models are equivalent for
computations that contain only closed transactions, as long as aborted transactions are "ignored."
For systems that support open nesting, however, the three models are distinct. We will use these
three models in subsequent chapters to understand semantics of TM implementations.
6.2 Transactional Computation Tree Framework
This section defines our framework for modeling transactional computations. Our model is inspired
by Frigo's computation-centric modeling of a program execution as a computation dag (directed
acyclic graph) [Fri98] with an "observer function" which essentially tells what write operation is
"seen" by a read. Our model uses a "computation tree" to model both the computation dag and
the nesting structure of transactions. We first define computation trees without transactions, then
we show how transactions can be specified, and finally, we define Lamport's classical sequential-
consistency model [Lam79].
112
I _; /_i;;r_~__ _ _ 1_~~ _~~___( ~I  _____ ( ; I_ ~;/i_ ;  _____ ~__;__il^_ _____II_ _ (li;  i~l_
Our computation-centric model focuses on an a posteriori analysis of a program execution.
After a program completes, we assume the execution has generated a trace which is abstractly
modeled as a pair (C, 4)), where C is a "computation tree" describing the memory operations
performed and transactions executed, and (D is an "observer function" describing the behavior of
read and write operations. We shall define C and 4 more precisely below. We define U to be the
set of all possible traces (C, (P).
Within this framework, we define a memory model as follows:
Definition 3 A memory model is a subset A C U.
That is, A represents all executions that "obey" the memory model.
Computation trees without transactions
The computation tree C summarizes the information about the control structure of a program to-
gether with the structure of nested transactions. We first describe how a computation tree models
the structure of a program execution in the special case where the computation has no transactions.
Structurally, a computation tree C is an ordered tree with two types of nodes: memory-
operation nodes memOps(C) at the leaves, and control nodes spNodes(C) as internal nodes. Let
nodes(C) = mem0ps(C) U spNodes(C) denote the set of all nodes of C.
We define M to be the set of all memory locations. Each leaf node u E mem0ps(C) represents
a single memory operation on a memory location 1 E M. We say that node u satisfies the read
predicate R(u, £) if u reads from location f. Similarly, u satisfies the write predicate W(u, e) if u
writes to f.
The internal nodes spNodes(C) of C represent the parallel control structure of the computation.
In the manner of [FL97], each internal node X E spNodes(C) is labeled as either an S-node or
P-node to capture fork/join parallelism. All the children of an S-node are executed in series from
left to right, while the children of an P-node can be executed in parallel.
Several structural notations will help. Denote the root of a computation tree C as root(()C).
For any internal node X E spNodes(C), let children(X) denote the ordered set of X's children.
For any tree node X E nodes(C), let ances(X) denote the set of all ancestors of X in C, and
let desc(X) denote the set of all X's descendants. Denote the set of proper ancestors (and de-
scendants) of X by pAnces(X) (and pDesc(X)). Denote the least common ancestor of two nodes
X 1 , X 2 EC by LCA(Xi, X 2 ).
Since every subtree of a computation tree is also a computation tree, we shall sometimes over-
load notation and use a subtree and its root interchangeably. For example, if X - root(()C), then
memOps(X) refers to all the leaf nodes in C, and children(()C) refers to the children of X.
Computation dags
A computation tree C defines a computation dag G(C) = (V(C), E(C)) constructed as follows
and illustrated in Figure 6.1. For every internal node X E spNodes(C), we create and place two
corresponding vertices, begin(X) and end(X) in V(C). For every leaf node x E memOps(C), we
place the single node x in V(C). For convenience, for all x E mem0ps(C), we define begin(x)
end(x) = x.
113
Formally, the vertices of the graph V(C) are defined as follows:
V(C) = memops(C) U U {begin(X), end(X)}
(XEspNodes(C)
For any computation tree rooted at node X, we define the edges E(X) for the graph G(X) recur-
sively:
Base case: If X E mem0ps(C), then define E(X) = 0.
Inductive case: If X E spNodes(C), let children(X) = {Y 1 , Y2, ., Yk}. If X is an S-node,
then
E(X)= {(begin(X), begin(Yi)) , (end(Yk), end(X))}
((end(Y)1begin(Yii))}) u ( E(Y))
If X is a P-node, then
E(X) (U E )(Y)
SU ( (begin(X) begin(Y)) , (end(Y), end(X))}
We shall find it convenient to overload the LCA function, and define the least common ancestor
of two graph vertices u, v E V(C) as the LCA of the corresponding tree nodes.
The computation dag G(C) is a convenient way of representing the flow of the program exe-
cution specified by C. Unfortunately, our specification of computation dags via computation trees
limits the set of computation dags that can be described. In particular, computation trees can only
specify "series-parallel" dags [FL97]. We might have founded our framework for transactional-
memory semantics on more-general computational dags, but the added generality would not affect
any of our theorems, and it would have greatly complicated definitions and proofs.
We shall find it useful to define some graph notations. For a graph G = (V, E) and vertices
u, v E V, we write u ___G v if there exists a path from u to v in G, and we write u -<G v if u = v
and u <G v. For any dag G = (V, E), a topological sort S of G is an ordering of all the vertices of
V such that for all u, v E V, we have u --< v implies that u <s v (u comes before v in 8). For a
dag G, we define topo(G) as the set of all topological sorts of G.
Classical theories on serializability refer to a particular execution order for a program as a
history [Pap79]. In our framework, a history corresponds to a topological sort S of the computation
dag G(C). We define our models of TM using these sorts. Reordering a history to produce a serial
history is equivalent to choosing a different topological sort S' of G(C) which has all transactions
appearing contiguously, but which is still "consistent" with the observer function associated with
S.
Transactional computation trees
We can specify transactions in a computation tree C by marking internal tree nodes. Marking a
node T E spNodes(C) as a transaction corresponds to defining a transaction T that contains the
114
I
(b)
Ti
Figure 6.1: A sample (a) computation tree C and (b) the corresponding dag G(C) for a computation that has
closed and open transactions. In this example, T2 is open-nested inside T and T8 is open-nested inside T7 .
The X's are tree nodes that are not marked as transactions. We have not specified whether each transaction
is committed or aborted.
115
M -I
computation subdag G(T), where begin(T) is the start of the transaction and end(T) is the end of
the transaction.' Formally, the computation tree C specifies a set xactions(C) C spNodes(C) of
internal nodes as transactions, and a set open(C) _ xactions(C) of open transactions. The set
of closed transactions is closed(C) = xactions(C) - open(C). In Figure 6.1, nodes T through
T8 are transactions, and X1 through X5 are ordinary nodes. Define a transaction T e xactions(C)
as nested inside another transaction T' E xactions(C) if T' c ances(T). Two transactions T and
T' are independent if neither is nested in the other.
Observer functions
Instead of specifying the value that a vertex v E memOps(C) reads from or writes to a memory
location f E M, we follow Frigo's computation-centric framework [Fri98, FL98] which abstracts
away the values entirely. An observer function2  (v) : mem0ps(C) -+ mem0ps(C) U {begin(C)}
tells us which vertex u E mem0ps(C) writes the value of £ that v sees. For a given computation
tree C, if v E mem0ps(C) accesses location £e M, then a well-formed observer function must
satisfy --(v -<G(C) D(v)) and W(((v), £). In other words, v can not observe a value from a vertex
that comes after v in the computation dag, and v can only observe a vertex if it actually writes to
location £. To define 41 on all vertices that access memory locations, we assume that the vertex
begin(C) writes initial values to all of memory.
Together, a computation tree C and an observer function D defined on mem0ps(C) specify a
trace.
6.3 Transactional Sequential Consistency
This section uses the computation tree framework to describe sequential consistency and then gen-
eralizes the definition to transactional programs. First, we use our framework to define Lamport's
classic model of sequential consistency [Lam79] in our transactional model. We then define some
notation in order to define transactional sequential consistency. Transactional sequential consis-
tency is the basis of our definition for more interesting transactional memory models.
Sequential consistency without transactions
We now follow Frigo [Fri98] in defining a "last-writer" observer function.
Definition 4 Consider a trace (C, 4) with no, transactions and a topological sort S E topo(G(C)).
For all v E mem0ps(C) such that R(v, £) V W(v, £), the last writer of v according to S, denoted
Ls(v), is the unique u E memOps(C) U {begin(C) } that satisfies three conditions:
1. W((u, £),
2. u <s v, and
1We assume that every leaf x e memops(C) is its, own committed, closed transaction, but we do not mark leaves as
a transactions in our model.20ur definition of q is similar to Frigo's [Fri98][, but with a salient difference, namely, Frigo's observer function
gives values for all memory locations, not just for the location that a vertex accesses. Moreover, if W(v, £), Frigo
defines P(v) = v, whereas we define q(v) = u for some u 0 v.
116
( i (L"i-'ll;ii--x'~~"~-r--"--~"'ii-~cr:i-T ;-^-- ~ r-X-- -;;-- ; -;; ---ll-i- cr^ .-;ir lr- ;r;._;.;~.:: i: r ;-;~:: -- ~~:~-i~i---:-;--;-----~r:- rr-l-ii~c il,:-"- ~u---ii lri;""l'"l
1- l 'u
Figure 6.2: Examples of sequential consistency for a computation C with only committed trans-
actions. Shown is the computation dag G(C). For the observer function 41 given by
(4 1(1) = 0, ,1( 2 ) = 1, 1(3 ) = 0, D1(4) = 0, 1(5):= 2), the trace (C, 4<1) is sequentially consistent,
with the topological sort S = (0, 1, 2,3,4,5) of G(C). For the observer function 42 given by
(42(1) = 0, 4#2(2) = 1, (2(3) = 0, 42(4) = 0, 2 (5):= 1), however, the trace (C, 2) is not sequentially
consistent, because there is no topological sort consistent with the last-writer function.
3. -n3w s.t W(w, £) A (u <s w <s v).
In other words, if vertex v accesses (reads or writes) location £, the last writer of v is the last vertex
u before v in the order S that writes to location e.
We can use the last-writer function to define sequential consistency for computations containing
no transactions.
Definition 5 Sequential consistency for computations without transactions is the memory model
SC = {(C, ): 3S E topo(G(C)) s.t. ) = £s} .
By this definition, a trace (C, 4) with no transactions is sequentially consistent if there exists a
topological sort S of G(C) such that the observer function 4) satisfies (4(v) = 4s(v) for all mem-
ory operations v E memps(C). Definition 5 captures Lamport's notion [Lam79] of sequential
consistency: there exists a single order on all operations that explains the execution of program.
Figure 6.2 shows a sample computation dag G(C) and two possible observer functions, D1 and D2.
The trace (C, 4i1) is sequentially consistent, but (0, 42) is not.
Transaction Contents
The computation tree C also specifies a set committed(C) C xactions(C) of committed transac-
tions. Similarly, transactions belonging to aborted(()X) = xactions(X) - committed(X) are
aborted transactions. First, we define some notation to classify committed transactions as either
closed or open. Define the set of closed, committed transactions as
cCom(C) = committed(C) - open(C),
and the set of open, committed transactions as
oCom(C) = committed(C) n open(C).
To specify content sets, we require the following definition.
117
n2
Definition 6 For any T E xactions(C) and'a memory operation u E memOps(T), let
Q(u, T) = xAnces(u) - ances(T).
Define q(u, T) as the T* E Q(u, T) which is the closest ancestor of u such that T* 0 cCom(C), or
null if no such T* exists. In other words, ifT* exists, then we have
(xAnces(u) - ances(T*)) C cCom(C)
but T* 0 cCom(C).
Using q(u, T), we define the content sets as follows.
Definition 7 At a time t, for any T E xactions(C), define closed content set cContent(T)
= {u E mem0ps(C) : q(u, T) = null}.
Definition 8 At a time t, for any T E xactions(C), define open content set oContent(T)
= {U E mem0ps(C) : q(u, T) E oCom(C)}.
Definition 9 At a time t, for any T E xactions(C), define aborted content set aContent(T)
= {u E memOps(C) : q(u,T) E aborted(C)}.
The intuition behind the closed content set for a transaction T is that a memory operation
u E memOps(T) belongs to T's closed content set if every transaction in ances(u) up to, but not
including T, is a closed and committed transaction. This case occurs only if q(u, T) is null. Since
each instruction belongs to one of the three content sets, we can say
cContent(T) = V(T) - U V(Z) - U V(Z).
Zeopen(T)-{T} ZEaborted(()T)-{T}
We always have cContent(T) C V(T), and equality holds when T's subtree contains no open or
aborted transactions.3 For example, in Figure 6.1, memory operations ul and u2 do not belong to
cContent(T1 ), because T 2 is an open transaction nested within T1 . As another example from the
figure, we have v2 E cContent(T4 ) if and only if T5 E committed(C).
We also define the holders of a vertex v E V(C) to be the set
h(v) = {T c xactions(C) : u E cContent(T)}
of all transactions that contain v.
3We consider only global open nesting, meaning that if T' is open-nested in T, then it is open with respect to every
transaction in ances (T). Alternatively, one might specify T' as open-nested with respect to an ancestor transaction T.
In this case, the operations of T' are excluded from all transactions T" on the path from T' up to and including T,
but included in transactions that are proper ancestors. of T. Intuitively, if T' is open-nested with respect to T, then
T' commits its changes to T's context rather than directly to memory. Global open-nesting is then the special case
when all open transactions are open with respect to root(()C). As far as we know, there are no implementations of
non-global open nesting.
118
Hidden vertices
Basic transactional semantics dictate that committed transactions should not "see" values written
by vertices belonging to the content of an aborted transaction. One may argue whether one aborted
transaction should be able to see values written by a another aborted transaction. We take the
position that up to the point that a transaction aborts, it should be "well behaved" and act as if it
would commit. The well-behavedness of aborted transactions is implicitly assumed by the various
proposals for open nesting [MCC+06, MH05, Mos06]. Thus, one aborted transaction should not
see values written by other aborted transactions, although the values written by a vertex within an
aborted transaction may be seen by other vertices within the same transaction.
The following definition describes which vertices are hidden from which other vertices.
Definition 10 For any two vertices u, v E V(C), let X = LCA(u, v). We say that u is hidden from
v, denoted uHv, if
a U cContent(Y).
YEaborted(()X)--{X}
In Figure 6.1, we have v 2Hz 2 if and only if at least one of TI, T4, or T5 belongs to aborted(()).
Since T2 is an open transaction, however, we never have UlHz2 if T2 , T 3 E committed(C), even if
T1 E aborted(()C). If we have T1 , T 4 E committed(C) and T7 E aborted(()C), then we also
have yl Hvz, but not vl Hyl, and thus the hidden relation H is not symmetric.
Transactional sequential consistency
We now extend the definition of sequential consistency to account for transactions. Our definition
does not attempt to model atomicity, however - that is the topic of Section 6.4. It simply models
that a transaction outside an aborted transaction cannot "see" values written by the aborted transac-
tion. Moreover, our definition makes the assumption that an aborted computation is consistent up
to the point that it aborts.
We first redefine the last-writer function to take aborted transactions into account. Intuitively,
another transaction should not be able to "see" the values of an aborted transaction.
Definition 11 Consider a trace (C, 4) E U and a topological sort S E topo(G(C)). For all
v E memOps(C) such that R(v, £) V W(v, £), the transactional last writer of v according to S,
denoted Xs(v), is the unique u E memOps(C) U {begin(C)} that satisfies four conditions:
1. W (U, f),
2. u <s v,
3. -(uHv), and
4. Vw (W(w, f) A (u <s w <s v)) -r wlvl.
119
The first two conditions for the transactional last-writer function X are the same as for the last-
writer function £. The third and fourth conditions of Definition 11 parallel the third condition of
Definition 4, except that now v ignores vertices u or w that write to £ but which are hidden from v.
Sequential consistency can now be defined for computations that include transactions. The
definition is exactly like Definition 5, except that the last-writer function Cs is replaced by the
transactional last-writer function Xs.
Definition 12 Transactional sequential consistency is the memory model
TSC = {(C, D) : S E topo(G(C)) s.t. D = Xs}
6.4 Transactional Memory Models
In this section, we use our framework to define three different transactional memory models: seri-
alizability, race freedom, and prefix-race freedom. The intuition behind all three memory models
is to find a single linear order S on all operations that both "explains" all memory operations and
provides guarantees about every transaction.. Serializability requires that all transactions appear
as contiguous in S. Race freedom weakens serializability by allowing transactions that do not
"conflict" to interleave their memory operations in S. Finally, prefix-race freedom weakens race
freedom by only prohibiting conflicts with the prefix of a transaction.
Serializability
Serializability [Pap79] is the standard correctness condition for transactional systems.
Definition 13 The serializability transactional memory model, ST, is the set of all traces (C, D) c
Ufor which there exists a topological sort S C topo(G(C)) that satisfies two conditions:
1. 4 = Xs, and
2. VT E xactions(()C) and Vv c V(C), we have begin(T) s v <s end(T) implies
v V(T)).
Informally, an execution belongs to ST if there exists an ordering on all operations S such that the
observer function 1c is the transactional last writer Xs, and for every transaction T, the vertices in
V(T) appear contiguous in S.
Race freedom
Our definition of race freedom is motivated by the observation that actual TM implementations
allow independent transactions to interleave. their executions provided that one transaction does not
try to write to a memory location accessed by the other transaction. Normally, with only closed-
nested transactions and ignoring operations. from aborted transactions, we expect to be able to
rearrange any interleaved execution order allowed by race freedom into an equivalent serializable
order. As we shall see in Section 6.5, the two models are indeed equivalent for computations having
120
)
only closed and committed transactions. With aborted and open transactions in the model, however,
we shall discover that the models are distinct.
To define race freedom, we first describe what it means to have a transactional race between a
memory operation and a transaction with respect to a topological sort of the computation dag.
Definition 14 Let C be a computation tree, and suppose that S E topo(G(C)) is a topolog-
ical sort of G(C). A (transactional) race with respect to S occurs between v E V(C) and
T E xactions(C), denoted by the predicate RACEs(v, T), if v V(T) and there exists a
w E cContent(T) satisfying the following conditions:
1. - (vHw),
2. 3 E M s.t. (R(v, f) A W(w, e)) V (W(v, ) A R(w, f)) V (W(v, f) A W(w, £)), and
3. begin(T) <s v <s end(T)
The notion of a race is easier to understand when all transactions are committed, in which case
no vertices are hidden from each other. Intuitively, a race occurs between transaction T and a
vertex v V(T) appearingbetween begin(T) and end(T) in S if v "conflicts" with some vertex
u E cContent(T), where by "conflicts," we mean that v writes to a location that u reads or writes,
or vice versa.
We can now define race freedom.
Definition 15 The race-free transactional memory model RFT is the set of all traces (C, (P) E U
for which there exists a topological sort S E topo(G(C)) satisfying two conditions:
1. 4b = Xs, and
2. Vv E V(C) and VT E xactions(C), -RACEs(v, T)
The first condition of race freedom is the same as for serializability, that the observer function is the
transactional last writer. The second condition allows an operation v to appear between begin(T)
and end(T) in S, but only provided no race between v and T exists.
Prefix-race freedom
The notion of a prefix-race is motivated by the operational semantics of TM systems. As two
transactions T and T' execute, if T' discovers a memory-access conflict between a vertex v E T'
and T, then the conflict must be with a vertex in T that has already executed, that is, the prefix of
T that executes before v. For prefix-race freedom, no such conflicts may occur.
Definition 16 Let C be a computation tree, and let S E topo(G(C)) be a topological sort ofG (C).
A (transactional) prefix-race with respect to S occurs between v E V(C) and T E xactions(C),
denoted by the predicate PRACES(v, T), if v V(T) and there exists a w E cContent(T) satisfy-
ing the following conditions:
1. -'(vHw)
2. 3f E M s.t. (R(v, f) A W(w, £)) V (W(v, f) A R(w, f)) V (W(v, £) A W(w, £)).
3. begin(T) _s w <s v <s end(T)
Thus, this definition is identical to Definition 14, except that the potential conflicting vertex w must
occur before v in S.
The notion of a prefix-race gives rise to an corresponding memory model in which prefix-races
are absent.
Definition 17 The prefix-race-free transactional memory model PRFT is the set of all traces
(C, 4) C Ufor which there exists a topological sort S E topo(G(C)) satisfying two conditions:
1. A4 = Xs, and
2. Vv E V(C) andVT E xactions(C), -PRACEs(v, T)
Thus, prefix-race freedom describes a weaker model than race freedom, where a vertex v is
only guaranteed to not to conflict with the vertices of transaction T that appear before v in S. If a
"nontransactional" leaf node v E mem0ps(C) runs in parallel with a transaction T, all of Definitions
13, 15, and 17 check whether v interleaves within T's execution. Thus, these models can be thought
of as guaranteeing "strong atomicity" in the parlance of Blundell, Lewis, and Martin [BLMO5].
In Scott's model [Sco06], RACEs(v, T) and PRACEs(v, T) can be viewed as particular "conflict
functions."
Relationships among the models
The following theorem shows that the memory models as presented are progressively weaker.
Theorem 6.1 ST c RFT c PRFT
PROOF. Follows directly from Definitions. 13, 15, and 17. OE
For computations with only closed and committed transactions, prefix-race freedom and seri-
alizability are equivalent, as we shall see in Section 6.5. When open and aborted transactions are
considered, all three models are distinct.
6.5 Distinctness of the Models
In this section, we study the memory models of serializability, prefix-race freedom, and race free-
dom. Specifically, we show that for computations containing only committed and closed transac-
tions, all three models are equivalent. We also demonstrate that when aborted and/or open transac-
tions are allowed, all three models are distinct.
Since these models are distinct under certain conditions and not distinct under others, we define
certain special sets of traces. Definition 3 states that a memory model A is a subset of U, the
122
i _ _ ; ;; _1__ _____; ___ji~_;:;i_:_;__;__^ ~; _ ~~j___----------_(_
universe of all possible traces. Sometimes, we wish to restrict our attention to computations with
only closed and/or committed transactions. Thus, we define the following subsets of U:
Uo = {(C, P) E U: xactions(C) = 0} ,
Uclo = {(C, )E U: open(C) =0} ,
Ucom = {(C, ) EU : aborted(()C) = 0} ,
Uc&c = Ucio n Ucom
In other words, U0 contains traces (whose computations) include no transactions, U1,lo contains
traces that include only closed transactions, Ucom contains traces that include only committed trans-
actions, and U&c contains traces that include only committed and closed transactions.
Dependency graphs
Before addressing the distinctness of the memory models directly, we first present an alternative
characterization of sequential consistency for the special case of computations with only committed
transactions. The idea of a "dependency" graph is to add edges to the computation dag to reflect
the dependencies imposed by the observer function.
Definition 18 The set ofdependency edges ofa trace (C, 4) E Uom is Td(C, () = { (u, v) E V (C) x V (C)
and the set of antidependency edges is ja (C, 4) = {(u, v) E V(C)x V(C) : (4b(u) - 4 (v)) A
W (v, e)}. The dependency graph of (C, f) is the graph D9 (C, D) = (V, E), where V = V(C)
and E = E(C) U Jd(C, () U Ia(C, ().
The sets 'qd and xPea capture the usual notions of dependency and antidependency edges from the
study of compilers [KKP+ 8 1]. A dependency edge (u, v) indicates that v observed the value written
by u. An antidependency edge (u, v) means that if both u and v observe the same write to a location
£, and if v performs a write, then u must "come before" v.
The following lemma, presented without proof, shows that in the universe of all traces with
only committed transactions, a trace (C, 4)) is sequentially consistent if and only if the dependency
graph DG(C, 4)) is acyclic.4
Lemma 6.2 Suppose that (C, (D) E com. Then, we have (C, (14) E SC if and only if the depen-
dency graph D9 (C, 4) is acyclic. ]
Figure 6.3 shows the dependency graphs for the example traces from Figure 6.2. Whereas the
trace (C, (1) is sequentially consistent, the trace (C, (2) is not. Equivalently by Lemma 6.2, the
dependency graph DG(C, D1) is acyclic, but the graph DG(C, 4b2) is not.
We can now prove the equivalence of serializability, race freedom, and prefix-race freedom
when we consider only computations with committed and closed transactions.
Theorem 6.3 ST n U &, = RFT n Uca = PRFT n IUec.
"One must extend the definition of an antidependency edge to prove an analogous result when the computation C
has aborted transactions. Lemma 6.2 does not hold without the assumption that every write to a location also performs
a read.
123
DG(C, ,.):::
DG(c, '(D .......... .......
DG(C, w) '
-@
--- 1r. . ......-
Control Edge Dependency Edge Anti-dependency Edge
Figure 6.3: Dependency graphs Dg(C, 1I) and DG(C, P2) for the traces from Figure 6.2. Since (C, 1i) E
SC, the graph DG(C, Q1) is acyclic, but since (G, 42) SC, the graph DG(C, 42) contains a cycle,
namely (2, 3, 4, 5, 2).
PROOF. Since Theorem 6.1 shows that ST C RFT C PRFT, it suffices to prove that PRFTn M c&
ST n Uc&.
We start by defining some terminology. For u, v E V(C), define the alternation count ofu and
v as
A(u, v) = h(u)l + h(v)I -2 Ih(LCA(u, v))l .
(The holders function h( was defined in Section 6.2.) Thus, A(u, v) counts the number of trans-
actions T E xactions(C) that contain either u or v, but not both. For any topological sort S of
G(C), define the alternation count of S, denoted alt(S), as the sum of all A(u, v) for consecutive
u and v in S. Intuitively, alt(S) counts the number of times we "switch" between transactions as
we run through S.
We prove by contradiction that for any trace (C, 4I) E U&c, we have (C, 4) E PRFT implies
(C, 4) c SC. Suppose that a trace (C, 41) E 4 a, exists that is prefix-race free but not serial-
izable. Consider any prefix-race-free topological sort S E topo(Dg(C, ()) that has a minimum
alternation count alt(S) over all sorts in topo(Dg(C, (I)). By Lemma 6.2, S satisfies the condition
4 = - (the first condition for all three transactional models).
Since (C, c) ST, some transaction T exists that is not contiguous in S (and therefore violates
the second condition in Definition 13). Let T be such a transaction, and let vl be the first vertex
such that vl V(T) and begin(T) <s v <s end(T). Choose vertices t <s u1  s u 2 <s vl s
v 2 <s wl <s w 2 , such that ul = begin(T) as shown in Figure 6.4(a). Define the sets A 1 , A 2, and
124
_ _~
(a) A, A2  A3
0 .-- -**** . . O
t begin(7) u2  v, V2  W1  W2  end(T)
= U,
(b) A2  Al A3
• -- . .o * * i * *
t V1  V2 begin(7) U2  Wl W2  end(T)
= U1
Figure 6.4: Two topological sorts of a computation graph G(C) for a hypothetical trace (C, 4) which is
prefix-race free, but not serializable. Transaction T is not contiguous in the topological sort S in (a). One
can convert S into the topological sort S' in (b). Doing so reduces the alternation count.
A 3 as follows:
A 1 = {x E V(T): ul is x <s u2},
A 2 = {x E V(C) - V(T) : vl <s x <s v2}, and
A 3 = {x E V(T): w <s x <s w 2 } -
Define two sets A1 = {x E V(T) : ul Is x <s u 2} and A3 = {x E V(T) : wl <s X <s w 2 }
whose vertices all belong to V(T). Define A 2 = {x E V(C) - V(T) : vl Is x <s v2 } as the set
interleaved between the contiguous fragments of T.
From S, we construct the new order S' shown in Figure 6.4(b) in which the intervals A1 and
A 2 are interchanged. We shall show that (1) S' E topo(Dg(C, 4)) (and therefore 4 = Xs,), (2) S'
is still a prefix-race-free topological sort of g (C, 4), and (3) alt(S') < alt(S), thereby obtaining
the contradiction that S is not a prefix-race-free topological sort with minimum alternation count.
To prove these three facts, we shall use a "nonconflicting" property: no pair of vertices y E
A, and z E A 2 exist such that y and z access the same memory location and one of them is a
write. Otherwise we have PRACEs(z, T) by definition because y E cContent(T), z 1 V(T), and
begin(T) <s y <s z <s end(T). Thus, A, and A 2 do not perform "conflicting" accesses to
memory.
To establish (1), that S' E topo(Dg(C, 4)), we show that for any y E A, and z E A 2 , no
edge (y, z) belongs to the graph D9(C, 1). If we have (y, z) E Td(C, 4) U Ta(C, 4), then y
and z access the same memory location and one of those accesses is a write, contradicting the
nonconflicting property above. Alternatively, if we have (y, z) E E(C), then LCA(y, z) must be an
S-node with y to the left of z. Since z 0 V(T), we have LCA(T, z) (= LCA(y, z)) is an S-node, and
thus we have end(T) -< z. Thus, S was not a valid sort of Dg(C, 4), and (y, z) 0 E(C).
To establish (2), that S' is prefix-race free, we show that swapping A1 and A 2 cannot introduce
any prefix races that weren't already there in S. Suppose that there is a prefix-race in S'. Then,
125
there must exist a v E V(C) and a transaction T1 E xactions(C) satisfying all three conditions
of Definition 16 for S'. Let w E cContent(T) be the candidate vertex that satisfies the three
conditions. In particular, the third condition gives us begin(T) <s' w <s, v <s' end(T). We
consider two cases, each of which leads to a contradiction.
In the first case, suppose that v <s w. Since v and w swap in the two orders, we must have
v E A1 and w E A2 . But, then they conflict by the second condition of Definition 16, which cannot
occur because of the nonconflicting property above.
In the second case, suppose that w <s v.. Since there is no prefix-race in S, the only situation
in which this can happen is when v falls entirely outside transaction T in S, which is to say that
begin(T) <s w <s end(T) <s v. Since: end(T) and v swapped, we must have end(Ti) E A1
and v E A 2. Since A, C cContent(T), it follows that end(Ti) E cContent(T), and thus T
must be nested within T. Consequently, we have w E A, which cannot occur because of the
nonconflicting property.
To establish (3), that alt(S') < alt(S), let us examine the difference 6 = alt(S) - alt(S') in
the alternation counts of S and S'. The only terms that contribute to 6 are at the boundaries of A,
and A 2 . We have that
6 = A(t, ul) + A(u 2 , Vl) + A(v 2, W1)
-A(t, vi) - A(v 2, ul) - A(U2, W1)
=2 (lh(LCA(t, v1 )) + h(LCA( 2, U1))j
+Ih(LCA(u2wl))I - Ih(LCA(t, ul))
- h(LCA(u 2 , v)) - Ih(LCA(v 2 , w1)))
By construction, we know that {ul, U2 ,tU w2 } c V(T), whereas none of t, vi, and v2 have T
as an ancestor. For any y E V(T) and z 0 V(T), we have LCA(y, z) = LCA(T, z), which yields
6 = 2(lh(LCA(t,vl))I +Ih(LCA(u 2 ,wl))I
- Ih(LCA(t, T)) I- Ih(LCA(T, vi))l)
Since LCA(u 2,wl) E desc(T), we know h(LCA(u 2 , 1)) Q h(T) and Ih(LCA(u 2 ,wl)) >
Ih(T) . Since t, vl 0 V(T), we have h(LCA(T, t)) C h(T) and h(LCA(T, vi)) C h(T).5 Thus,
h(LCA(u 2 , wi))I > max { h(LCA(T, t))l, Ih(LCA(T, vi))I} ,
and a similar algebra yields
Ih(LCA(t, v!))I 2 min { h(LCA(T, t))1, [h(LCA(T, vi))l }
Consequently, we conclude that 6 = alt(S) - alt(S') > 0. OE
Aborted transactions
We now consider computations with aborted transactions. We are unaware of any prior work on
transactional semantics that explicitly models aborted transactions. The reason is simple: when
5In this case, we have a proper subset because LCA(T, t), LCA(T, v1 ) E pAnces(T) and we exclude T.
126
---------- .... .... ......
computations have only closed transactions, aborted transactions do not affect a program's out-
put. Since TM systems do not allow committed transactions to observe data directly from aborted
transactions, in most cases, vertices from aborted transactions are free to observe arbitrary values.6
In a system with open nesting, however, we must include aborted transactions in the memory
model if we wish to understand what happens when an open transaction commits but its parent
aborts. We contend that a reasonable transactional consistency model for open transactions must
not only model aborted transactions, but it should also guarantee that an aborted transaction T
is consistent up to the point it aborts. Otherwise, any open subtransactions within T may obtain
inconsistent values and still commit.
The next theorem shows that when aborted transactions are modeled, the three transactional
memory models are distinct.
Theorem 6.4 ST n Ucio RFT n lio 9 PRFT n Ulo
PROOF. Since Theorem 6.1 shows that ST C RFT C PRFT, we need only show that ST n 1o1 #
RFT n Heio and that RFT n Hclo # PRFT n 4o 1,.
We first exhibit a computation that is race free but not serializable. Consider the computation
dag G shown in Figure 6.5. Let (C1 , 41) be the trace that generates G, where transactions T2 and
T3 abort but transaction T4 commits. We shall show that (Ci, D1I) E RFT, but (C1 , i91) ST.
If transaction T4 commits, then for any topological sort S satisfying Xs = 'i, we must have
0 <s 3 <s 6 <s 9. Thus, T cannot be contiguous within S, implying that (Ci, O1) ST.
We can show that (C1, i1) is race free, however. Let S be (0, 1,..., 12). One can verify that Pb1
is indeed the transactional last-writer function according to S (since T4 commits, - (6H9), and thus
1 (9) = Xs (9)). The only transactions that might violate the second condition of Definition 15 are
transactions that do not appear contiguous in S, in this case, only Ti. The only candidate vertex
v for RACEs(v, T) is v = 6. Since T2 is an aborted subtransaction of TI, however, neither 3 or 9
belong to cContent(TI). Thus, picking S = (0, 1,... ,12) ensures that T causes no races.
We next exhibit a computation that is prefix-race free but not race free. Consider (C2 , 4)2) as
the trace generating the same computation dag C from Figure 6.5, but this time with T2 aborted
and T3 and T4 committed. We shall show that (C' ,2) RFT, but that (C2 , (2) E PRFT.
To show that (C2, 42) is not race free, observe that in any topological sort S E topo(G)
for which ,P = Xs, we must have RACEs(6, TI), since begin(TI) <s 6 <s end(TI), vertices
6 and 9 access the same memory location x, and vertex 6 is a write, and -(6H9). The order
S = (0, 1,..., 12) is prefix-race free, however, since 9 5s 6. The only transactions that might
violate the second condition of prefix-race freedom are those that do not appear contiguous in S,
in this case, only T 1. When we look at the vertex v 6 that falls between begin(TI) and end(TI),
we only look at the prefix of T before v (vertices 1 through 4) for a prefix-race conflict, and there
is none.
The proof holds whether T, commits or aborts. D
Open transactions
We now study computations with open transactions but where all transactions commit. In this
context, the three models ST, RFT, and PRFT are distinct.
6 This intuition is not strictly true in a model that does not analyze an execution a posteriori, since control flow can
be affected by inconsistent data and prevent a program from terminating.
127
Figure 6.5: An example distinguishing the memory models. The transactions T2 and T3 are closed-nested
inside of T1. If transaction T4 commits, then this computation is not serializable, because T4 must interleave
inside of T1 . If both transactions T2 and T3 abort, then the execution is race free. If T2 aborts and T3
commits, then this execution is not race free, but it is prefix-race free.
Theorem 6.5 ST n u -om o RFT n lom q PRFT n Uo m .
PROOF. Since Theorem 6.1 shows that ST C RFT C PRFT, we need only show that ST n Uom =,
RFT n U,,om and that RFT n Ucom # PRFT n Uom. The trace in Figure 6.6 shows a (Ci, ()1)
ST, but (Ci, 1) E RFT. Figure 6.7 shows (C2, 42) RFT, but (C2, 42) E PRFT.
Consider a trace (C1, 1) E Uom that generates the computation dag shown in Figure 6.6. This
trace describes Schedule 3 from the code in Figure 7.1. We can show that (C1, I1) 0 ST, but that
(C1, 41) E RFT.
Suppose for the purpose of contradiction that (Ci, 4 1) E ST. Any topological sort S that
satisfies the first condition of Definition 13 must be consistent with all dependency edges. But then,
we must have 3 <s 7 <s 10 <s 11 <s 15 <s 19. Thus, if we consider V(A) and pick v = 10
(or 11), we violate the second condition of Definition 13. Said differently, because transaction J1
reads from 11, and transaction 12 reads from J1 , we cannot have V(A) appear contiguous in any
topological sort S.
By picking S = (1, 2, ... , 26), the trace (C1, '1) is race free. We can verify this fact by check-
ing all transactions T satisfying the second condition of Definition 15. First, the nested transactions
1, 12, J1, and J 2 all appear contiguous in S, and thus there is no such v that appears between the
begin and end vertices to satisfy the third condition of Definition 14. For transaction A, all vertices
that appear between begin(A) and end(A) that do not belong to V(A) belong to V(B). But for
all v E V(B) and w E cContent(A), vertices v and w access different memory locations. (Recall
that I, and 12 are excluded from cContent(A) because they are open transactions.) Thus, we can
not satisfy the second condition of Definition 14 for A. A similar argument rules out races with B.
To show that RFT n Ucom = PRFT nr Uom, consider the trace (C2, 2) E Ucom shown in
Figure 6.7, which represents an execution of the program from Figure 7.2. If all transactions
commit, then (C2, D2 ) 0 RFT. Any topological sort S satisfying the first condition of race freedom
must have 13 <s 15, because 15 reads the value of b written by 13. Looking at A, we can pick
v = 13 and w = 15. We see that begin(A) - v -< end(A), w E cContent(A), and the first
128
Mi
Figure 6.6: When all transactions commit, this computation dag G(CI) with observer edges 4)l is not
serializable, but is race free. This trace represents Schedule 3 from the program in Figure 7.1.
and second conditions of Definition 14 are satisfied. Thus, we satisfy RACEs(13, A), and hence
(C2, 2) > RFT.
We can show that (C2, (2) E PRFT, however. Consider S = (1, 2,..., 17). Since transactions
B and C appear contiguous in S, we only need to check A for prefix-races. The only vertices
v 0 V(A) that come between begin(A) and end(A) are v = 12 and v = 13, neither of which
conflicts with w in the prefix of A (vertices 1 through 6). Ol
Trade-offs among the models
The three transactional memory models of serializability, race freedom, and prefix-race freedom
exhibit different behaviors in TM systems that have open transactions.
With serializability, for any trace (C, (D) E ST, we can "change" the trace to convert any open
transaction T' nested inside a committed transaction T from open to closed while still keeping the
same i, and still be serializable. Thus, in some sense, with serializability, open nesting only differs
from closed nesting if an open transaction commits, but its parent aborts.
Race-freedom appears to be more difficult to implement than either serializability or prefix
race-freedom. For example, consider the example from Figures 7.2 and 6.7. After an transaction I,
(open-nested in A) commits, any number of other transactions (B and C) can read values written
by that open transaction and commit their changes, all before the original outer transaction A
completes. To support race freedom, it seems we may need to maintain the footprints of B and C
even after they have committed to detect a future conflict with A.
129
I I I -I
Figure 6.7: When all transactions commit, this computation dag G(C2) with observer edges 42 is prefix-
race free, but not race free, because a race exists between vertices 13 and 15.
6.6 Related Work
When transactional memory research began, researchers assumed that serializability would be the
obvious memory model for TM. Recently, since we completed the work described in this chapter,
there has been work on formal models for transactional memory. Moore and Grossman [MG08]
focus on program semantics of transactional memory and how it might interact with other language
features. They introduce a framework to analyze the various features of software transactional
memory formally. Guerraoui and Kapalka [GK08] introduce opacity as the correctness criteria for
transactional memory.
Most of the debate on the semantics of transactional memory has recently concentrated on
strong versus weak isolation. Informally, strong isolation guarantees that transactional accesses to
memory are isolated from nontransactional accesses. In our work, we have concentrated almost ex-
clusively on strong isolation memory models. However, strong isolation may be difficult to provide
efficiently in software transactional memory since it prohibits many hardware and compiler opti-
mizations. Therefore, there has been much interest in understanding reasonable memory models
that provide weak isolation [MBS+08, SATG+09, DS09, SDMS08].
However, most of this work does not model precise semantics of open nested transactions.
Formal models for systems with nested transactions appear as early as the work by Beeri, Bernstein,
and Goodman [BBG89]. Most of this work is related to database transactions, however. Papers
providing operational semantics for open transactions include [MH05, MCC+06, Lib06]. Although
operational semantics of a TM can provide an abstract basis for implementation, inferring emergent
properties of the system from these semantics can be quite difficult. The work on opacity as a
correctness criteria does consider open nesting. Their work treats an open-nested transaction as
single abstract operation, however, and does not concern itself with specific memory operations
inside the open-nested transaction. Therefore, the semantics defined by their model is at a sligtly
130
--
higher level than that presented in this thesis.
131
132
_Ij___)_lil~_iiC__C1-i~i~;-li~i~-i-+m ---l--- ;i ~---~;x1I~_~__~;.__li(X:~-:_ 1-ii ~_i-li^i _i-i i_ l:iii i-illr~~~;;i ~;;ii--~:ill-----l~----ll
Chapter 7
Semantics of Open-Nested Transactions
This chapter explores open nesting in more detail. Chapter 6, we saw the framework for describing
TM semantics. In this chapter, we use that framework to define the precise semantics of open
nested transactions. We see examples of why open nested transactions are not serializable, and the
strange behaviors that result from this fact. We prove that in fact, open nested transactions exhibit
prefix-race freedom.
The chapter is organized as follows: Section 7.1 describes some behaviors, both desirable and
undesirable, allowed by open nesting, Section 7.2 describes the operational semantics of open
nesting using our computational tree framework, Section 7.3 presents a proof that this operational
semantics guarantees prefix-race freedon, and finally, Section 7.4 contains a discussion of undesir-
ablity of an unconstrained use of open-nested transactions.
7.1 Subtleties of Open Nesting
This section motivates the need for a precise description of the memory semantics of open-nested
transactions using three examples to illustrate some subtleties with open nesting. The first example
shows that some desirable schedules allowed by open nesting are not serializable. The second
example shows that the loss of serializability for open nesting sanctions arguably bizarre program
behaviors. The third example shows that open nesting compromises composability.
Figure 7.1 describes a program with nested transactions where the use of open nesting admits
a desirable schedule which is not serializable. Moreover, a system with only flat or closed nesting
prohibits the schedule. In Figure 7.1, transaction A reads from global variable a, adds a key-value
pair based on a to a global table, reads from b and adds a corresponding pair to the table, and then
stores the sum a + b into c. Transaction B performs analogous operations on d, e, and f. The table
data structure is implemented as a simple direct-access table [CLRS01, Section 11.1] with a global
s i z e field to count the number of elements in the table.
If the nested transactions (the I's and J's) are all flat-nested or closed-nested, then TM guar-
antees that the transactions are serializable: the program appears to executes as though either A
happened before B (Schedule 1) or B happened before A. The system might actually perform the
operations in a different, interleaved order (for example, Schedule 2), but this schedule is equivalent
to one of the two valid serial schedules (in this case, Schedule 1). Schedule 3 is not serializable,
This work was done in collaboration with Charles E. Leiserson and Jim Sukha [ALSO6].
133
1 Xbegin 13 Zbegin
2 read a l-- A, 14 read d Bt
3 xbegin 15 begin
4 Insert(a,a); i, 16 Insert (d, d) 31
5 xend _ 17 send
6 read b A 18 read .
A 7 begn B 19 abgi,
8 Insert (b,b) I2 20 Insert (e,) 3,
9 send 21 endj
10 C -a+b A3  22 f - d+. B
11 write a 23 write f
12 zend 24 zsend
25 Insert (z, y) {
26 A.array[zl y; Samp Ordaings:
27 read A.size Schedule 1: kA, A2, ,, A 3 , 3 , B 3, Bg
28 &.six* - A.mise + 1 Schedule2: B, A,, I1, A2,1,, B , 32, A, B3
29 write A.size Schedule 3: B,, A, I , A,,lAI, ], Bw B,
30 )
Figure 7.1: Two concurrent transactions that do not share any memory locations except in their
nested transactions. Divide transaction A into abstract operations A1, I1, A2, 2, A3, and divide B into
B 1, J1, B 2, J2, B 3. The I's and J's represent inserts to an abstract table data structure. Schedule 1 is a
serial order, Schedule 2 is an interleaved order equivalent to Schedule 1, and Schedule 3 is an interleaved
order which is not serializable.
however, because J (and thus B) observes the intermediate value of A. size written by I, (and
thus written by A). Consequently, Schedule 3 is prohibited with flat or closed nesting.
To improve concurrency, a programmer may wish to allow certain schedules that are not serial-
izable, but which nevertheless are consistent from the programmer's point of view. A system that
can admit nonserializable schedules imposes fewer restrictions on transactions, possibly allowing
transactions to commit when they would have otherwise aborted. For example, the programmer
may wish to admit Schedule3, even though the I's and J's happen to access the same size field.
Conceptually, the programmer may not care in which order the table inserts occur. For example, if
I,, 12, J1, and J2 are open transactions, then Schedule 3 is a valid execution.
Once a TM system with open nesting admits some desirable nonserializable schedules, how-
ever, the proverbial cat is out of the bag. As far as the memory semantics are concerned, it seems
difficult to prohibit additional program behaviors that might arguably be undesirable. Figure 7.2
shows a program execution allowed by the open-nesting implementations of [MH05, MCC+06]. In
this example, it is possible for all transactions A, I1, B, and C to commit, even though A does not
appear to execute atomically. Transaction A reads inconsistent data, since C writes to b between
A's reads of a and b. Thus, the "snapshot" of the world seen by A when it begins is different from
its snapshot part way through its computation.
Our final example illustrates how open nesting can admit subtle program behaviors that affect
the composability of transactions. Consider the program in Figure 7.3 which describes an imple-
mentation of a simple table library that (arguably) contains an subtle flaw. The program includes a
Contains (x) method to complement the Insert (x, y) method used in Figure 7.1. Since the
size field is the primary source of transaction conflicts between table operations, the Contains
method "optimizes" its search method by checking size within an open transaction.
Using TM with open nesting, in any sequence of Contains or Insert operations, each
individual operation still appears atomic. Thus, in transaction A in Figure 7.3, we might expect
134
1111 1 I
1 xbegin 8  bein
2 read a -A 9 read i
3 xbegin_open B 10 i i + 1
4 readi 11 write i
5 i -i+ 1 k~- 12 xend
6 write i
7 endopen 13 xbegin
14 read i
18 read b C 15 b- i
19 c - a + b 16 write b
20 write c 17 xend
21 xend
Figure 7.2: A program execution permitted by open nesting. Transaction A does not appear to execute
atomically, because it can read an "inconsistent" value for b if B and C interleave between the execution of
A1 and A2.
2 if (M(ctain(5)) 4 if (Icontains(S))
A 7 Zsert (5,15) Inuert (5, 10)
8 rend 6 ed
Cnutais(z) nert (z, y)
9 abegin 19 zbegin
10 found - true 20 A.array[Z] - y
11 abegin-pea 21 read A. aizse
12 zead A.size 22 A.si 4-A.simze + 1
13 if (A.Size-O) empty-trm 23 write A. size
14 iadaopea 24 aad
15 if (empty)
16 tound -- A.array[z]
17 return ((lmpty) & faound)
18 rend
Figure 7.3: Flawed implementation of a table data structure with two methods, Contains (x) and
Insert (x, y). Although each method individually appears atomic, transactions A and B, which call
those methods, may not appear atomic. In particular, the ordering < 1, 2, 3, 4, 5, 6, 7, 8 > is allowed.
that if the Contains operation returns false, then the key can be safely inserted into the hash
table without adding duplicates.
Unfortunately, one cannot correctly call both Contains and Insert inside a transaction
T and still have T appear to be atomic. Indeed, the open-nesting implementation described in
[MCC+06] allows the entire transaction B to execute between Lines 2 and 7 of transaction A.
Thus, this code shows that composability of transactions is not preserved. When using open nesting,
simply ensuring the atomicity of individual transactions is not sufficient to guarantee composability.
Admittedly, the examples in Figures 7.2 and 7.3 are somewhat contrived. In particular, unlike
in Figure 7.1, transactions in Figures 7.2 and 7.3 cannot be partitioned into clear abstraction levels,
with each level accessing disjoint memory locations, as Moss suggests may be necessary [Mos06].
These examples suggest, however, that for open nesting, the distinction between the abstract pro-
gram model and the low-level memory model is much more significant than for closed or flat
nesting. Thus, these examples motivate the need to understand memory models for open nesting
so that at the very least we can understand what properties should be enforced by higher-level
mechanisms.
135
7.2 The Operational Model
This section presents an abstract operational model for open nesting, called the ON model. This
model is a generalization of most open nesting implementations, specifically the Stanford model [MCC + 06].
We prove that the ON model implements at: least prefix race-freedom but is strictly weaker than
race freedom.
We begin our description of the ON model by defining some notation. For any set S C
nodes(C) of tree nodes, let lowest(S) be the node X E S such that S C ances(X), if such
a X exists. Otherwise, define lowest(S) - null. Thus, if all nodes in S all fall on one root-to-
leaf path in C, then lowest(S) is the lowest node on that path. Define highest(()S) in a similar
fashion. For any T E xactions(C), define xparent[T] = lowest(ances(T) n xactions(C)),
that is, xparent[T] is the transactional parent of T. For any X C nodes(C), let xAnces(X) =
ances(X) n xactions(C) be the set of transactional ancestors of X.
Abstractly, we shall view the ON model for open nesting as a nondeterministic state machine
ON that constructs a sequence of traces. The initial trace contains a computation tree consisting
of a single S-node root(C) E spNodes(C) with associated sets xactions(C) = {root(C)}
and open(C) = committed(C) = aborted(C) = 0) and an empty observer function (P. By
assuming that root(C) E xactions(C), we simplify the description of the model by treating the
entire computation C as a global closed transaction in which other transactions are nested. The
computation also maintains an initially empty auxiliary set done(C) C nodes(C) of nodes that
have finished their execution. The computation tree C and all these associated sets only grow
during the execution.
At any time during the computation, a subset ready(()C) of S-nodes are designated as ready,
meaning that they can issue a program instruction, which include read, write, fork, j oin,
xbegin, xbegin_open, and xend. The ON machine nondeterministically chooses a ready S-
node to issue an instruction, and the machine processes the instruction which augments (C, 4) by
adding nodes to the tree and to its associated sets. Unlike other associated sets ready(()C) may
grow and shrink during execution.
We shall factor the description of the state machine ON by describing the creation of the com-
putation tree C and the observer function 1i separately.
Creating the computation tree
How the computation tree C evolves depends on the instructions that are issued nondeterministi-
cally. Let X be the S-node that issues an instruction. The instructions are handled as follows:
* read from a location £ E M: If the read causes a conflict (more about conflicts when
we describe the creation of the observer function) with one or more transactions, abort' the
deepest such transaction T by adding all transactions T' E desc(T) n xactions(T) -
done(C) both to aborted(C) and to done(C). Keep checking for and aborting conflicting
transactions T, deepest to shallowest, until no such conflicting transactions exist. Then,
'The ON machine uses a "pessimistic" concurrency control mechanism in that it immediately aborts a conflict-
ing transaction T upon conflict. Moreover, it always, aborts T rather than its own transaction. One could abort the
transaction performing the read, but the model is simpler by always aborting T and not providing a nondeterministic
choice.
136
'-~i-:=;?Y~~:~~~l~r~~~-~r~i"~i-~-iL ri-i i ;l-- ii -rr.;;;_~i;i:i-^ ii -- l-:j i:--i---~;;n-ii; -- ----------- i; -r; ;r;ii; i i_ ;;i-:i;::;-,- -.1~ ---.- ;F --;;:r-l'--~---- 
create a new read node v E mem0ps(C) as the last child of the S-node X. Add v to
done(C).
* write to a location f E 4M: Similar to read.
* fork: Create a new P-node Y E nodes(C) as a child of X, and create two new S-nodes as
children of Y. Add these two children to ready(()C), and remove X from ready(()C).
* j oin: Test whether X's sibling belongs to done(C). If yes, then add X and then parent [X]
to done(C). Remove X from ready(()C), and add parent [parent [X]] (the grandparent of
X which is an S-node) to ready(()C). If no, then remove X from ready(()C), and add X
to done(C).
* xbegin: Create a new S-node Y E nodes(C) as the last child of X. Add Y to xactions(C).
Remove X from ready(()C), and add Y to ready(()C).
* xbegin open: Similar to xbegin, but also add Y to open(C).
* xend: Test whether X E xactions(C). If yes, remove X from ready(()C), and add
parent[X] to ready(()C). Add X to done(C) and to committed(C). If no, error.
The ON machine maintains several invariants. All transactions are S-nodes. Every P-node has an
S-node as its parent and has exactly two S-nodes as children. If an S-node is ready, none of its
ancestors are ready.
Creating the observer function
To create the observer function, the ON model maintains auxiliary state to keep track of how
values are propagated among transactions and global memory. Specifically, every transaction T E
xactions(C) maintains a readset R(T) and a writesetW(T). The readset R(T) is a set of pairs (f, v),
where f E M is a memory location and v E mem0ps(C) is the memory operation that read from £,
that is, we maintain the invariant R(v, £) for all (i, u) E UTExactions(C) R(T). The writeset W(T) is
similarly defined. We initialize R(root(C)) = W(root(C)) = {(f, begin(root(C))) : f E M}.
The ON model maintains two invariants concerning readsets and writesets. First, it maintains
W(T) C R(T) for every transaction T E xactions(C), that is, a write to a location also counts
as a read to that location. Second, R(T) and W(T) each contain at most one pair (f, v) for any
location £. Because of this second invariant, we employ the shorthand £ E R(T) to mean that
there exists a node u such that (£, u) E R(T), and similarly for W(T). We also overload the union
operator to accommodate this assumption: if we write R(T) +- R(T) U { (£, u) }, then if there exists
(e, u') E R(T), we mean to replace it with (£, u). Likewise, if 7 accesses a location f, we employ
the shorthand a E R(T) to mean that (£, u) E R(T), and similarly for W(T).
The state machine ON handles events as follows, where X is the S-node that issues the instruc-
tion:
* read from location £ E M: If there exists a T E xactions(C) - done(C) - ances(X)
such that £ E W(T), then a conflict occurs. Let v be the read operation added as the last child
of X. Define S - {T E xactions(C) n ances(v) : £ E R(T)}, let T' = lowest(S), and
let (£, u) E R(T'). Add (£, u) to R(T), and set 1(v) = u.
137
* write to a location f E AM: Similar to read, but to check for a conflict, test whether there
exists a T E xactions(C) - done(C) - ances(X) such that e E R(T). Find u in the same
way, and add (f, u) both to R(T) and to W(T), and set (v) = u.
* xbegin and xbegin_open: Initialize R(Y) = 0 and W(Y) = 0.
* xend: If X E closed(C), then add R(X) to R(xparent[X]) and add W(X) to W(xparent[X]).
IfX E open(C), then let Q = xAnces(T). For any (e, u) c W(T), let ac = {T' E Q I E R(T')}.
For all such T' E a , R(T') +- R(Tt) U {(f, u)}. Similarly, let 13 = {T' Ec Q E W(T')}.
For all T' CE /3 , W(T') +- W(T') U {(£, u)}.
* fork or j oin: No action.
The Stanford model [MCC+06] is similar to the ON model, except that it only supports "lin-
ear" nesting (transactions can have no parallel transactions within them) and the choice of which
transaction to abort is nondeterministic. Neither of these differences affects the theorems that deal
with the ON model, assuming they implement their system with pessimistic concurrency control.
7.3 Prefix Race-Freedom of the Operational Model
We now prove that the ON model is prefix-race free with respect to the natural topological sort S
of G(()C) created by the nondeterministic operation of the ON machine. Specifically, as the ON
model generates a trace (C, b), it creates tree nodes nodes(C) = spNodes(C) U mem0ps(C) and
eventually marks these nodes as "done" by placing them in done(C). We can view this process as
determining the topological sort S of G(()C) as follows. When a node X E nodes() is created, the
vertex begin(X) E V(C) is appended to S. When a node is marked as done, the vertex end(X) E
V(C) is appended to 8. If the node X is a memory operation, we have begin(X) = end(X) = X,
and we view it as being appended only once. It is straightforward to verify that S is indeed a
topological sort of G(()C), and indeed of DG(C, 4).
We begin with a definition of time in the ON model. If v E V(C) is the tth element of S, we
say that v occurs at time t, and we write t = 8(v). Thus, for all u, v E V(C), we have u <s v if
and only if S(u) < S(v). We can view the evolution of (C, 1) over time as a sequence (C(t), (t))
for t = 0, 1,..., where the operation that occurs at time t creates (C(t), 4(t)) from (C(t- ), (t-1)).
For convenience, however, we shall omit time indices unless clarity demands it.
We define two time-sensitive sets. The set:ofactive transactions at any given time is act iveX(C) =
xactions(C) - done(C). The spine of a memory location £ E M at any given time is spine() =
{T E activeX(C) : f E W(T)}.
We now state a structural lemma that describes invariants of the computation tree C as it
evolves.
Lemma 7.1 The ON machine maintains the following invariants:
1. IfT E activeX(C), then we have xAnces(T) C activeX(C).
2. Ifv E W(T), then v E V(T).
3. All transactions in spine(£) are on the same root to leaf path in C, and hence the node
lowest(spine(£)) exists.
138
_; i_;_ll;llXii__l)lliil~ii-iil--iiiiiii-) i-_-l--iiiiiili ii-- -_ -------1- ~~--i~_i___._ I
4. If f E R(T), where T c activeX(C), then we have either spine(e) C ances(T) or T E
ances(lowest(spine(f))).
5. If (, u) E R(T)forsome T E activeX(C), then (f,'u) E W(T'), where T'= lowest(xAnces(T)n
spine(e)).
6. Let (f, u) E W(TI) and (e, v) E W(T2), where T1 , T2 E spine(f). If T E ances(T2), then
U <S V.
7. Let (f, u) E W(T) and let u <s v such that lW(v, f). Then, we have v E desc(T).
PROOF. By induction.
For the base case, we start with xactions(C) {root(C)}, and done(C) = 0. Our initial
readset R(root(C)) and writeset W(root(C)) contain all pairs (t, begin(root(C))) for all f E M.
Thus, spine(e) = {root(C)} for all locations f E Av, and all invariants are satisfied.
For the inductive step, assume all invariants are true at time t - 1.
The fork, j oin, xbegin, xbeginopen events do not involve any memory operations or
readset/writeset manipulations. Thus, for these events, we only need to argue that Invariant 1 is
preserved at time t.
To check the invariants after a memory operation to a location f, we first label some transactions
on spine(f) for time t - 1. Let T1, T2,... Tk be the k transactions along spine(e). By Invariant 3,
for all i, Ti E ances(T+l) (all the Ti's are along the same root-to-leaf path), and £ e W(T). Let xi
be the memory operation such that (f, xi) E T. By Invariant 4 and Invariant 5, any transaction T'
for which £ E R(T') satisfies (f, xj) E R(T') for some j, that is, T' has the value xj from its closest
transactional ancestor on Tj on the spine. By Invariant 6, we know xi <s xj for all 1 < i < j < k,
and by Invariant 7, we know for any y E V(C) that satisfies W(y, £) and xi <s y, y E desc(Ti).
Consider a read operation to £ issued by an S-node X, that generates a conflict. Then there
exists a T c xactions(C) - done(C) - ances(X) such that f E W(T). By Invariant 3, T E
spine(f), and suppose T = T. The ON model aborts T and all other transactions T* E desc(Ti),
thereby truncating spine(f) to be spine(t) n ances(T-_1) (we never abort root(C) = TI). In
this case, Invariant 3 through Invariant 5 are all maintained. Invariant 6 and Invariant 7 are all also
maintained because we have only removed elements from the spine.
Conflicts for a write operation are similar; if we detect a conflict with T such that f E W(T),
then we have the same behavior as before. If £ E R(T) but -,( c W(T)), then we abort T which
only reads £, leaving the spine intact.
After checking for conflicts and aborting the appropriate transactions, X then adds a memory
operation v to the computation tree. Let T* = lowest({Y|Y E xAnces(T) and Ec W(Y)}), i.e.,
the lowest transactional ancestor of X which has t in its writeset. We know that v gets its
Let St = {T E xAnces(v) : £ e R(T)} By the ON model, we know v reads the value of u
from T* = lowest(Se). Let Tj lowest({Y/: Y E xAnces(T*) and £ E W(Y)}) be the lowest
transaction on spine(£) that is an ancestor of T*. By Invariant 5, we know that T* and Tj must
have the same pair (u, f) in their readset/writeset, respectively. We know T E spine(f), and if T
is not the last transaction Tk, on the spine, then T* must still conflict with Tk. Thus, when we add
operation v, we are still reading (u, f) from the end of the spine, thereby maintaining Invariant 4
and Invariant 5. Since xj already happened, then xj <s v, and Invariant 6 is satisfied. We have
X E desc(T"), maintaining Invariant 7.
139
If the memory operation v to be added is a write, then by similar logic as a read, v is added
onto the end of spine(e).
On an xend operation for a closed transaction T, consider two cases. Iff E R(T), but £ 0 W(T),
then committing (u, £) to xparent[T] does not alter the above invariants because u must be the
same as the value xk from Tk, the last transaction on spine(£); spine (£) itself remains unchanged.
If £ E W(T), then we must be committing T& to some transaction T* between Tk-1 and Tk on the
spine. If T* # Tk-1, then T* becomes the new last transaction on the spine, with pair (£, xk). If
Tk-1 = Tk, then the value (£, Xk-1) disappears from the spine.
The commit operation for an open transaction replaces all the values (£, xi) along the spine.
Then, since (£, xk) goes all the way up the spine, the last two invariants are still preserved.
O
The next three lemmas describe additional structure of the computation tree.
Lemma 7.2 For all T E aborted(C) andT' e activeX(C), ifv E cContent(T), then we have
v W(T'). O
Lemma 7.3 Ifv E mem0ps(C) accesses £ E M, then at time S(v), we have spine(£) C ances(v).
]
Lemma 7.4 For all v E V(C), T E aborted(C), and w E cContent(T), if end(T) <s v, then
we have wHy. O
The next lemma shows that a memory location written within a transaction remains in the
writeset of some active descendant of the transaction.
Lemma 7.5 Let w E memOps(C) n cContent(T) be a memory operation in a transaction T E
xactions(C), and suppose that W(w, £) for some location £ E M. Then, at all times t in the
range S(w) < t < S(end(T)), we have f E W(T')for some T' E desc(T) n activeX(T).
PROOF. We proceed by induction on time. For the base case, at time S(w), location £ is added to
W(xparent[w]), and xparent[w] E desc(T) n activeX(T). For the inductive step, let £ e W(T')
for some T' G desc(T) n activeX(T). Once a location is added to a transaction's writeset, it is
never removed until the transaction commits or aborts. If T' = T, then we are done. Otherwise,
we have T' E pDesc(T) and by definition of cContent(T), it follows that T' 0 open(C) U
aborted(C). Therefore, at time S(end(T')), location £ is added to W(xparent[T']), at which time
xparent [T'] is an active descendant of T. OE
We can now prove that the ON model admits no prefix-races.
Lemma 7.6 For all v E memOps(C) and T E xactions(C), we have -PRACES(v, T).
PROOF. Suppose for contradiction that PRACEs(v, T). Then, by Definition 16, we have v 0 V(T)
(or equivalently, T 0 ances(v)), and there exists a w e cContent(T) such that -,(vHw) and
begin(T) <s w <s v <s end(T), where v and w access the same location £ c memOps(C) and
one of those accesses is a write.
Consider the case when W(w, £). By Lemma 7.5, at time S(w) we have f E W(T'), where
T' E desc(T). At time S(v), vertex v is added to R(xparent[v]), and xparent[v] desc(T'),
140
ii i --;--~i~r-i---__ il- -( - l i- ~ _i- i;
because otherwise v E desc(T') C desc(T). Therefore, at time S(v), we have f E R(xparent[v])
and E E W(T'), which violates Invariant 4 in Lemma 7.1.
The case when R(u, £) is analogous. Oi
The next series of lemmas show that the observer function created by the ON machine is the
transactional last-writer function according to S.
Lemma 7.7 For all T E xactions(C), T' E activeX(C), and u E cContent(T), if T
committed(C) at time t and u E W(T') at time t, then T' E desc(T).
PROOF SKETCH. One can prove by induction that at any time t such that S(u) < t < S(end(T)),
we have h(u) C xAnces(u) - pAnces(T) and h(u) n (open(T) - {T}) 0. O
Lemma 7.8 For any v E mem0ps(C), if (v) = u., then -i(uHv).
PROOF. Assume for contradiction that uHv holds. Then, there exists T E pDesc(LCA(u, v)) n
aborted(C) such that u E cContent(T). If the ON machine sets (I(v) = u, then u E R(T')
for some T' E xAnces(v). By Invariant 5 in Lemma 7.1, it follows that u E W(T"), where T" E
ances(T'), and hence T E ances(T") by Lemma 7.7. Therefore, we have T E ances(v), and
LCA(u, v) = T E pDesc(T). Contradiction. Ol
We say that a vertex v E memOps(C) is alive, denoted alive(v), if h(v) n aborted(C) = 0.
Lemma 7.9 Let w E V(C) be the last vertex in S such that W (w, £) and alive(w). Then, there
exists (T) E spine(f) such that (f, w) E W(T').
PROOF SKETCH. At time S(w), by Invariant 3 of Lemma 7.1, we have (f, w) E W(xparent[w])
and xparent[w] E spine(f). Assume for contradiction that w is not on the spine. Since w is alive,
w can only be removed from spine(f) by being overwritten by some y such that W(y, f) holds,
and w <s y (from Invariant 6 from Lemma 7.1). Since w is the last writer to t which is alive, we
have -alive(y). One can show that -alive(w) in this case. Ol
Lemma 7.10 For u, v E mem0ps(C) that both access a memory location e E.A, if Q(v) = u,
then for any w E mem0ps(C) such that u -<s w -<s v and W(w, f), we have wIHy.
PROOF. Assume for the purpose of contradiction that there exists a w E memOps(C) such that
u -<s w -<s v, W(w, f), and -'wHv. Consider the last such aw.
Ifw E cContent(T) for some T E aborted((C(s(())), then by Lemma 7.4 we have wHy.
If w is not in the contents of any aborted transaction at time S(v), then by Lemma 7.9, we
have w E W(T) for some transaction T E spine(f) and T E ances(v) by Lemma 7.3. Let TR =
lowest({T E xAnces(v) : E R(T)}), and let Tv = lowest({T E xAnces(v) : E W(T)}). If
D(v) = u, then we have u E R(TR), since the ON machine always reads from the lowest an-
cestor that has f in its readset. By Invariant 5, we have au W(Tw), but since u <s w, we have
Tw E pAnces(T) by Invariant 6 in Lemma 7.1. Therefore, T is a lower ancestor of v than Tvw,
contradicting the fact that Tw is the lowest ancestor of v with C in its writeset. L
We now can prove that the observer function for the ON model is the transactional last-writer
function.
Lemma 7.11 If the ON model generates an execution (C, P), then D = Xs.
PROOF. This proof is technical and is provided in Appendix .3.
D]
Theorem 7.12 The ON model implements prefix race-free freedom.
PROOF. Combine Lemmas 7.6 and 7.11. O
7.4 Discussion
Open Nesting was proposed as a loophole: in order to increase the performance and scalability
of transactional programs. The open-nesting methodology incorporates the open-nested commit
mechanism [MCC+06, MH06]. When an open-nested transaction Y (nested inside transaction
X) commits, Y's changes are committed to memory and Y's read and write sets are discarded.
Thus, the TM system no longer detects conflicts with X due to memory accessed by Y. In this
methodology, the programmer considers Y's internal memory operations to be at a "lower level"
than X; thus X should not care about the memory accessed by Y when checking for conflicts.
Instead, Y must acquire an abstract lock based on the high-level operation that Y represents and
propagate this lock to X, so that the TM system can perform concurrency control at an abstract
level. Also, if X aborts, it may need to execute compensating actions to undo the effect of its
committed open-nested subtransaction Y. Moss in [Mos06] illustrates use of open nesting with an
application that uses a B-tree. Ni et al.[NMAT+07] describe a software TM system that supports
the open-nesting methodology.
As we have seen in this section, an unconstrained use of the open-nested commit mechanism
can lead to anomalous program behavior that can be tricky to reason about. We believe that one
reason for the apparent complexity of open nesting is that the mechanism and the methodology
make different assumptions about memory. Consider a transaction Y open nested inside transaction
X. The open-nesting methodology requires that X ignore the "lower-level" memory conflicts
generated by Y, while the open-nested commit mechanism will ignore all the memory operations
inside Y. Say Y accesses two memory locations £ and f 2, and X does not care about changes
made to L2 , but does care about £L. The TM system can not distinguish between these two accesses,
and will commit both in an open-nested manner, leading to anomalous behavior.
Researchers have demonstrated specific examples [CMC+07, NMAT+07] that safely use an
open-nested commit mechanism. These examples work, however, because the inner (open) trans-
actions never write to any data that is accessed by the outer transactions. Moreover, since these
examples require only two levels of nesting, it is not obvious how one can correctly use open-
nested commits in a program with more than two levels of abstraction. The literature on TM offers
relatively little in the way of formal programming guidelines which one can follow to have provable
guarantees of safety when using open-nested commits.
In the next chapter, we shall see a new TM design that constrains the use of open nesting in
order to provide gurantees that are easier to reason about.
142
i:\~_i:C______l(j;__3__1~1_____1_1__)(__ -I--~-(-----------(-~--~ll~_-l~-_lii-.li i._:^~l ii-:i ~-i~~--ii--i- __._.__.- _ i----I-
Chapter 8
Safe Open-Nested Transactions Through
Ownership
In the Chapter 7, we saw that open nesting provides an unintuitive and noncomposable mem-
ory model that is hard to reason about. In this chapter, we bridge the gap between memory-level
mechanisms for open nesting and the high-level view by explicitly integrating the notions of "trans-
actional modules" (Xmodules) and "ownership" into the TM system. The ownership-aware TM
system allows the programmer to safely use the methodology of open nesting, because the runtime's
behavior more closely reflects the programmer's intent. In addition, the structure imposed by own-
ership allows a language and runtime to enforce properties needed to provide provable guarantees
of "safety" to the programmer. Therefore, a concurrency platform that provides ownership aware
transactions provides concurrency as well as safety to a programmer who is using transactions.
The chapter is organized as follows. Section 8.1 explains the precise contributions of the work
described in this chapter. Section 8.2 present an overview of ownership-aware TM and highlights
key features using an example application. Section 8.3 describes language constructs for specify-
ing Xmodules and ownership. In Section 8.4, we extend the transactional computation framework
from Chapter 6 to formally incorporate Xmodules and ownership. Section 8.5 describes the precise
operational model for ownership-aware transactions, and Section 8.6 gives a formal definition of
serializability by modules, and a proof-sketch that the operational model guarantees this defini-
tion. Section 8.7 provides conditions under which the ownership-aware TM not exhibit semantic
deadlocks. Section 8.8 concludes with a discussion of some related work.
8.1 Contributions
The contributions of this work on ownership-aware transactions are as follows:
1. We suggest a concrete set of guidelines for sharing of data and interactions between trans-
actional modules, called Xmodules. Xmodules can be thought of as software modules that
own and manage their own data and provide external functions to provide services to other
Xmodules.
This work was done in collaboration with Angelina Lee and Jim Sukha [ALS09].
143
2. We describe how the Xmodules and ovwnership can be specified in a Java-like language and
propose a type system that enforces most of the above-mentioned guidelines in the programs
written using this language extension.
3. We formally describe the operational model for ownership-aware TM, called the OAT model,
which uses a new ownership-aware commit mechanism. The ownership-aware commit
mechanism is a compromise between an open-nested and a closed-nested commit; when a
transaction T commits, a change to memory location £ is committed globally if f belongs to
the module of T; otherwise, the read or write to f is propagated to T's parent transaction. Un-
like an ordinary open-nested commit, the ownership-aware commit treats memory locations
differently depending on which Xmodule owns the location. Note that the ownership-aware
commit is still a mechanism; programmers must still use it in combination with abstract locks
and compensating actions to implement the full methodology.
4. We prove that if a program follows the proposed guidelines for Xmodules, then the OAT
model guarantees serializability by modules, which is a generalization of "serializability by
levels" used in database transactions.. Ownership-aware commit is the same as open-nested
commit if no module ever accesses data belonging to other modules. Thus, one corollary
of our theorem is that open-nested transactions are serializable when modules do not share
data. This observation explains why researchers [CMC+07, NMAT+07] have found it natural
to use open-nested transactions in the absence of sharing, in spite of the apparent semantic
pitfalls.
5. We prove that under certain restricted conditions, a computation executing under the OAT
model can not enter a semantic deadlock.
In later sections, we distinguish between the variations of nested transactions as follows. We
say that a transaction Y is vanilla open nested when referring to a TM system which performs the
open-nested commit of Y. We say that Y is safe nested when referring to the ownership-aware
TM system which performs the ownership-aware commit of Y. Finally, we say that a transaction
Y is an open-nested transaction when we are referring to the abstract methodology, rather than a
particular implementation with a specific commit mechanism.
8.2 Ownership-Aware Transactions
In this section, we give an overview of ownership-aware TM. To motivate the need for the concept
of ownership in TM, we first present an example application which might benefit from open nesting.
We then introduce the notion of an Xmodule and informally explain the programming guidelines
when using Xmodules. Finally, we highlight some of the key differences between ownership-aware
TM and a TM with vanilla open nesting. In this section, we present the intuitive descriptions of the
concepts in ownership-aware TM; we defer formal definitions until later sections.
Example application
We describe an example application for which one might use open-nested transactions. This exam-
ple is similar to the one in [Mos06], but it includes data sharing between nested transactions and
their parents, and has more than two levels of nesting.
144
Since the open-nesting methodology is designed programs to have multiple levels of abstrac-
tion, we choose a modular application. Consider a user application which concurrently accesses a
database of many individuals' book collections. The database stores records in a binary search tree,
keyed by name. Each node in the binary search tree corresponds to a person, and stores a list of
books in his/her collection. The database supports queries by name, as well as updates that add a
new person or a new book to a person's collection. The database also maintains a private hashmap,
keyed by book title, to support a reverse query; given a book title, it returns a list of people who
own the book. Finally, the user application wants the database to log changes on disk for recover-
ability. Whenever the database is updated, it inserts metadata into the buffer of a logger to record
the change that just took place. Periodically, the user application is able to request a checkpoint
operation which flushes the buffer to disk.
This application is modular, with five natural modules - the user application (UserApp),
the database (DB), the binary search tree (BST), the hashtable (Hashtable), and the logger
(Logger). The UserApp module calls methods from the DB module when it wants to insert
into the database, or query the database. The database in turn maintains internal metadata and
calls the BST module and the Hashtable module to answer queries and insert data. Both user
application and the database may call methods from the Logger module.
If the modules use open-nested transactions, a TM system with vanilla open-nested commits
can result in non-intuitive outcomes. Consider the example where a transactional method A from
the UserApp module tries to insert a book b into the database, and the insert is an open-nested
transaction. The method A (which corresponds to transaction X) calls an insert method in the
DB module and passes b (the Book object) to be inserted. This insert method generates an open-
nested transaction Y. Suppose Y writes to some field of the book b (memory location fl), and
also writes some internal database metadata (location £2). After a vanilla open-nested commit of
Y, the modifications to both £1 and £2 become visible globally. Assuming the UserApp does
not care about the internal state of the database, committing the internal state of the DB (£2) is a
desirable effect of open nesting; this commit increases concurrency, because other transactions can
potentially modify the database in parallel with X without generating a conflict. The UserApp
does, however, care about changes to the book b; thus, the commit of fl breaks the atomicity of
transaction X. A transaction Z in parallel with transaction X can access this location fl after Y
commits, before the outer transaction X commits." To increase concurrency, we want the method
from DB to commit changes to its own internal data; we do not, however, want it to commit the data
that UserApp cares about.
To enforce this kind of restriction, we need some notion of ownership of data: if the TM
system is aware of the fact that the book object "belongs" to the UserApp, then it can decide
not to commit DB's change to the book object globally. For this purpose, we introduce the notion
of transactional modules, or Xmodules. When a programmer explicitly defines Xmodules and
specifies the ownership of data, the TM system can make the correct judgement about which data
to commit globally.
'Note that abstract locks [Mos06] do not address this problem. Abstract locks are meant to disallow other transac-
tions from noticing the fact that the book was inserted into the DB. They do not usually protect the individual fields of
the book object itself.
145
Xmodules and the ownership-aware commit mechanism
The ownership-aware TM system requires that programs be organized into Xmodules. Intuitively,
an Xmodule A is as a stand-alone entity that contains data and transactional methods; an Xmod-
ule owns data that it privately manages, and uses its methods to provide public services to other
modules. During program execution, a call to a method from Xmodule A generates a transaction
instance (e.g., X). If this method in turn calls another method from an Xmodule B, an additional
transaction Y, safe nested inside X, is created only if A B. Therefore, defining an Xmodule
automatically specifies safe-nested transactions.
In the ownership-aware TM system, every memory location is owned by exactly one Xmodule.
If a memory location e is in a transaction T's read or write set, the ownership-aware commit of a
transaction T commits this access globally only if T is generated by the same Xmodule that owns
f; in this case, we say that T is "responsible"" for that access to e. Otherwise, the read or write to £
is propagated up to the read or write set of T's parent transaction; that is, the TM system behaves
as though T was a closed-nested transaction with respect to location £.
For ownership-aware TM to behave "nicely" we must restrict interactions between Xmodules.
For example, in the TM system, some transaction must be "responsible" for committing every
memory access. Similarly, the TM system should guarantee some form of serializability. If Xmod-
ules could arbitrarily call methods from or access memory owned by other Xmodules, then these
two properties might not be satisfied.
Rules for Xmodules
Ownership-aware TM uses Xmodules to control both the structure of nested transactions, and the
sharing of data between Xmodules (i.e., to limit which memory locations a transaction instance can
access). In our system, Xmodules are arranged as a module tree, denoted as D. In D, an Xmodule
B is a child of A if B is "encapsulated by" A. The root of D is a special Xmodule called world.
Each Xmodule is assigned an xi d by visiting the nodes of D in a left-to-right depth-first search
order, and assigning ids in increasing order, starting with xid(world) = 0. Therefore world has
the minimum xid, and "lower-level" Xmodules have larger xid numbers.
Definition 19 We impose two rules on Xmodules based on the module tree:
1. Rule 1: A method of an Xmodule A can access a memory location £ directly only if £ is
either owned by A or an ancestor of A in the module tree. This rule means that an ancestor
Xmodule B of A may pass data down to a method belonging to A, but a transaction from
module A can not directly access any "lower-level" memory.
2. Rule 2: A method from A can call a method from B only if B is the child of some ancestor
of A, and xid(B) > xid(A) (i.e., if B is "to the right" of A in the module tree). This rule
requires that an Xmodule can call methods of some (but not all) lower-level Xmodules. 2
The intuition behind these rules is as follows. Xmodules have methods to provide services
to other higher-level Xmodules, and Xmodules maintain their own data in order to provide these
2An Xmodule can, in fact, call methods within its own Xmodule or from its ancestor Xmodules, but we model these
calls differently. We explain these cases condition at the end of this section.
146
services. Therefore, a higher-level Xmodule can pass its data to a lower-level Xmodule and ask
for services. A higher-level Xmodule should not directly access the internal data belonging to a
lower-level Xmodule.
If Xmodules satisfy Rules 1 and 2, TM can have a well-defined ownership-aware commit mech-
anism; some transaction is always "responsible" for every memory access (proved in Section 8.5).
In addition, these rules and the ownership-aware commit mechanism guarantee that transactions
satisfy the property of "serializability by modules" (proved in Section 8.6).
One potential limitation of ownership-aware TM is that some "cyclic dependencies" between
Xmodules are prohibited. The ability to define one module as being at a lower level than another
is fundamental to the open-nesting methodology. Thus, our formalism requires that Xmodules be
partially ordered; if an Xmodule A can call Xmodule B, then conceptually A is at a higher level
than B (i.e., xid(A) < xid(B)), and thus B can not call A. If two components of the program call
each other, then, conceptually, neither of these components is at a higher-level than the other, and
we would require that these two components be combined into the same Xmodule.
Xmodules in the example application
Consider a Java implementation of the example application described earlier. It may have the fol-
lowing classes: UserApp as the top-level application that manages the book collections, Person
and Book as the abstractions representing book owners and books, DB for the database, BST and
Hashmap for the binary search tree and hashmap maintained by the database, and Logger for log-
ging the metadata to disk. In addition, there are some other auxiliary classes: tree node BSTNode
for the BST, Bucket in the Hashmap, and Buffer used by the Logger.
For ownership-aware TM, not all of a program's classes are meant to be Xmodules; some
classes only wrap data. In our example, we identified five Xmodules- UserApp, DB, BST,
Hashmap, and Logger; these classes are stand-alone entities which have encapsulated data and
methods. Classes such as Book and Person, on the other hand, are data types used by UserApp.
Similarly, classes like BSTNode and Bucket are data types used by BST and Hashmap to main-
tain their internal state.
We organize the Xmodules of the application into the module tree shown in Figure 8.1. UserApp
is encapsulated by world, DB and Logger are encapsulated under Use rApp; BST and Hashmap
are encapsulated under DB. By dividing Xmodules this way, the ownership of data falls out natu-
rally, i.e., an Xmodule owns certain pieces of data if the data is encapsulated under the Xmodule.
For example, the instances of Person or Book are owned by UserApp because they should only
be accessed by either UserApp or its descendants.
Let us consider the implications of Definition 19 for the example. Due to Rule 1, all of DB,
BST, Hashmap, and Logger can directly access data owned by UserApp, but the UserApp
can not directly access data owned by any of the other Xmodules. This rule corresponds to standard
software-engineering rules for abstraction; the "high-level" Xmodule UserApp should be able to
pass its data down, allowing lower-level Xmodules to access that data directly, but UserApp itself
should not be able to directly access data owned by lower-level Xmodules. Due to Rule 2, the
UserApp may invoke methods from DB, DB may invoke methods from BST and Hashmap, and
every other Xmodule may invoke methods from Lorgger. Thus, Rule 2 allows all the operations re-
quired by the example application. As expected, the UserApp can call the insert and search
methods from the DB and can even pass its data to the DB for insertion. More importantly, notice
147
xid:0
xid:1
xid: 2 DB Logger xid: 5
xid:3 xid:4
Figure 8.1: A module tree D for the program described in Section 8.2. The xid's are assigned according to
a left-to-right depth-first tree walk, numbering Xmodules in increasing order, starting with xid(world) =
0.
the relationship between BST and Logger. The BST Xmodule can call methods from Logger,
but the BST can not pass data it owns directly into the Logger. It can, however, pass data owned
by the UserApp to the logger, which is all this application requires.
Advantage of ownership-aware transactions
One of the major problems with vanilla open nesting is that some transactions can see inconsistent
data. Say a transaction Y is open nested inside transaction X. Let vo be the initial value of location
£, and suppose Y writes value vl to location £ and then commits. Now a transaction Z in parallel
with X can read this location £, write value v2 to e, and commit, all before X commits. Therefore,
X can now read this location £ and see the value v2 , which is neither the initial value vo (the value
of £ when X started), nor vl which was written by X's inner transaction, Y. This behavior might
seem counterintuitive.
Now consider the same example for ownership-aware transactions. Say X is generated by a
method of Xmodule A and Y is generated by a method of Xmodule B. If B owns £, X can not
access e, since xid(A) < xid(B) (by Definition 19, Rule 2), and no transaction from a higher-
level module can access data owned by a lower-level module (by Definition 19, Rule 1). Thus, the
problem does not arise. If B does not own f, the ownership-aware commit of Y will not commit
the changes to £ globally and f will be propagated to X's write set. Therefore, if Z tries to access £
before X commits, the TM system will detect a conflict. Thus X can not see an inconsistent value
for .3
3For simplicity, we have described the case where: Y is directly nested inside X. The case where Y is more deeply
open nested inside X behaves in a similar fashion.
148
~ ~
Callbacks
At first glance, the assumptions we have made regarding methods of Xmodules seem somewhat
restrictive. In the description thus far, we prohibit an Xmodule A from calling another transac-
tional method from A or a proper ancestor of A. In particular, it appears as though our model
disallows callbacks. Our model, however, does permit both these cases; we simply model these
calls differently.
If a method X from Xmodule A calls another method Y from an ancestor Xmodule B, this
new call does not generate a new safe-nested transaction instance. Instead, Y is subsumed in X
using flat (or closed) nesting. Recall that Rule 1 in Definition 19 allows X to access data belonging
to B or any of its ancestors directly. Therefore, we can treat any data access by a flat (or closed)
nested transaction Y as being accessed by X directly, provided that Y and its nested transactions
access only memory belonging to B or B's ancestors. We say that Y is a proper callback method
for Xmodule B if its nested calls are all proper callback methods belonging to Xmodules which are
ancestors of B. In our formal model in Section 8.4, we assume that we only have proper callbacks
and model them as direct memory accesses, allowing us to ignore them in the formal definitions.
Closed-nested transactions
In our model, every method call that crosses an Xmodule boundary automatically generates a safe-
nested transaction. Ownership-aware TM can effectively provide closed-nested transactions, how-
ever, with appropriate specifications of ownership. If an Xmodule A owns no memory, but only
operates on memory belonging to its proper ancestors, then transactions of A will effectively be
closed nested. In the limit, if the programmer specifies that all memory is owned by the world
Xmodule, then all changes in any transaction's read or write set are propagated upwards; thus all
ownership-aware commits behave exactly as closed-nested commits.
8.3 Ownership Types for Xmodules
When using ownership-aware transactions, the Xmnodules and data ownership in a program must
be specified for two reasons. First, the ownership-aware commit mechanism depends on these
concepts. Second, we can guarantee some notion of serializability only if a program has Xmodules
which conform to the rules in Definition 19. In this section, we describe language constructs and
a type system that can be used to specify Xmodules and ownership in a Java-like language. Our
type system - the OAT type system - statically enforces some of the restrictions described in
Definition 19.
The OAT type system extends the ownership types of Boyapati et al. [BLS03], which is de-
scribed first in this section. We then describe extensions to this type system to enforce some of the
restrictions in Definition 19. Next, we present code for parts of the example application described
in Section 8.2. Finally, we discuss some restrictions required by Definition 19 which the OAT type
system does not enforce statically. The type system's annotations, however, enable dynamic checks
for these restrictions.
149
Boyapati et al 's parametric ownership type system
The type system of Boyapati et al. provides a mechanism for specifying ownership of objects. The
type system enforces the properties stated in Lemma 8.1.
Lemma 8.1 The type system in [BLSO3] en]brces the following properties:
1. Every object has a unique owner
2. The owner can be either another object, or worl d.
3. The ownership relation forms an ownership tree (of objects) rooted at world.
4. The owner of an object does not change over time.
5. An object a can access another object b directly only if b's owner is either a, or one of a's
proper ancestors in the ownership tree.
Boyapati et al.'s type system requires ownership annotations to class definitions and type dec-
larations to guarantee Lemma 8.1. Every class type T1 has a set of associated ownership tags,
denoted T 1 (fi, f2, - -fn). The first formal fi denotes the owner of the current instance of the ob-
ject (i.e., this object). The remaining formals f2, f3, .. n are additional tags which can be used
to instantiate and declare other objects within the class definition. The formals get assigned with
actual owners o, o2, ... o, when an object a of type T1 is instantiated. By parameterizing class
and method declarations with ownership tags, the type system of [BLS03] permits owner polymor-
phism. Thus, one can define a class type (e.g. a generic hash table) once, but instantiate multiple
instances of that class with different owners in different parts of the program.
The type system enforces the properties in Lemma 8.1 by performing the following checks:
1. Within the class definition of type Ti, only the tags { f, f2, . . .f} U {this, world} are
visible. The this ownership tag represents the object itself.
2. Within a class definition, a variable c2 with type T2 (f2,. . .) can be assigned to a variable cl
with type T1(f, ... ) if and only if T2 is a subtype of T1 and fi = f2.
3. If an object a's tags are instantiated to be ol, 02,.. . On, when a is created, then in the ownership
tree, ol must be a descendant of oi, Vi E 2..n, (denoted by ol -< oi henceforth).
It is shown in [BLS03] that these type checks guarantee the properties of Lemma 8.1.
In some cases, to enable the type system to perform check 3 locally, the programmer may need
to specify a where clause in a class declaration. For example, suppose the class declaration of type
T1 has formal tags (fi, f2, f3), and inside T1's definition, some type T2 object is instantiated with
ownership tags (f2, f3). The type system can not determine whether or not f2 -< f3. To resolve this
ambiguity, the programmer must specify where (f2 <= f3) at the class declaration of type T1.
When an instance of type T2 object is instantiated, the type system then checks that the where
clause is satisfied.
150
;;;i__;_l__~il____~__~__(_l(__ _;_~lj_;
The OAT type system
The ownership tree described in [BLS03] exhibits some of the same properties as the module tree
we described in Section 8.2; however, the type system and ownership scheme of [BLS03] do not
enforce two major requirements of our system.
* In [BLS03], any object can own other objects. Our rules, however, require that only Xmod-
ules own other objects.
* In [BLS03], an object can call any of its ancestor's siblings. Our rules (namely Definition 19),
however, dictate that an Xmodule A can only call its ancestor's siblings to the right.
With these requirements in mind, we extend Boyapati et al.'s type system to create the OAT type
system.
The extensions to handle the first requirement are straightforward. The OAT type system ex-
plicitly distinguishes objects and Xmodules by requiring that Xmodules extend from a special
Xmodule class. The OAT type system only allows classes that extend Xmodule to use this as
an ownership tag. In the context of the Boyapati et al.'s ownership tree, this restriction creates a tree
where all the internal nodes are Xmodules and all leaves are non-Xmodule objects. If we ignore
any order imposed on the children of an Xmodule, for ownership-aware TM, the module tree (as
described in Section 8.2) is essentially the ownership tree with all non-Xmodule objects removed.
The second requirement is more complicated to enforce. First, we extend each owner instance
o to have two fields: name, represented by o. name; and index, represented by o. index. The name
field is conceptually the same as an ownership instance in the type system of [BLS03]. The index
field is added to help the compiler to infer ordering between children of the same Xmodule in the
module tree. The OAT type system allows the programmer to pass this [i] as the ownership
tag (i.e., with an index i) instead of this. Similarly, one can use world [i] as an ownership
tag. Indices enable the type system to infer an ordering between two sibling Xmodules A and
B; for instance, if an Xmodule C instantiates A and B with owners this [i] and this [i+ ],
respectively, then A appears to the left of B in the module tree.
Finally, for technical reasons, the OAT system prohibits all Xmodules A from declaring prim-
itive fields. If A had primitive fields, then by Boyapati et al.'s type system, these fields are owned
by the A's parent. Since this property seems counter-intuitive, we opted to disallow primitive fields
for Xmodules.
In summary, the OAT type system performs these checks:
1. Within the class definition of type T1, only the tags { fl2, f2.. fn} U {this, world} are
visible.
2. In a class declaration, a variable c2 with type T2 (f2, ... ) can be assigned to a variable cl with
type T1 (fi,...) if and only if T2 and T1 have the same type and all the formals match in
name. In addition, if the indices are specified for the tags, then they must match.
3. For a type T(o1, O2,... o,), we must have, for all i E {2,... n}, either ol. name -< oi. name
or ol. name = oi. name and ol. index < oi. index (if both indices are known).4
4 In the ownership tree, for any Xmodule A, the OAT type system implicitly assigns non-Xmodule children of A
higher indices than the Xmodule children of A, unless the user specifies otherwise.
4. The ownership tag this can only be used within the definition of a class that extends
Xmodule.
5. Xmodule objects can not have primitive-type fields.
The first three checks are analogous to the checks in Boyapati et al.'s type system. The last two
checks are added to enforce the additional requirements of Xmodules.
The OAT type system supports where: clauses of the form where (fi < fj); when fi and
fj are instantiated with oi and oj, the type system ensures that either oi. name -< oj. name, or
oi. name = oj. name and oi. index < o. index. The detailed type rules for the OAT type system
are described in [ALS08].
Example application using the OAT type system
Figure 8.2 illustrates how one can specify Xmodules and ownership using ownership types. The
programmer specifies an Xmodule by creating a class which extends from a special Xmodule
class. The DB class has three formal owner tags - dbO which is the owner of the DB Xmodule
instance, logO which is the owner of the Logger Xmodule instance that the DB Xmodule will
use, and dataO which is the owner of the user data being stored in the database. When an instance
of UserApp initializes Xmodules in lines 5-6, it declares itself as the owner of the Logger, the
DB, and the user data being passed into DB. The indices on this are declaring the ordering of
Xmodules in the module tree, i.e., the user data is lower-level than the Logger, and the Logger
is lower level than the DB. lines 11-13 illustrate how the DB class can initialize its Xmodules and
propagate the formal owner tags (i.e., logo and dataO) down.
Note that in order for this code to type check, the DB class must declare logO < dataO using
the where clause in line 10, otherwise the type check would fail at line 11, due to ambiguity of
their relation in the module tree. The where clause in line 10 is checked whenever an instance of
DB is created, i.e. at line 6.
The OAT type system's guarantees
The following lemma about the OAT type: system can be proved in a reasonably straightforward
manner using Lemma 8.1.
Lemma 8.2 The OAT type system guarantees the following properties.
1. An Xmodule Acan access a (non-Xmnodule) object b with ownership tag ob only if A -
Ob. name.
2. An Xmodule Acan call a method in another Xmodule Bwith owner OB only if one of the
following is true:
(a) A = OB. name (i.e. A owns B);
(b) The least common ancestor of A and B in the module tree is OB. name.
(c) B >- A (i.e. B is an ancestor of A).
152
1 public class UserApp<appO> extends Xmodule {
2 private Logger<this[l], this[2]> logger;
3 private DB<this [O] , this [1], this [2]> db;
4 public UserApp() {
5 logger = new Logger<this[l], this[2]>();
6 db = new DB<this[O], this[l], this[2]>(logger);
7 }
8
9 public class DB<dbO, logO, dataO>
10 extends Xmodule where (logO < dataO) {
11 private Logger<logO, dataO> logger;
12 private BST<this[O0], logO, dataO> bst;
13 private Hashmap<this[l], logO, dataO> hashmap;
14 public DB(Logger<logO, dataO> logger) {
15 this.logger = logger;
16 )
17
Figure 8.2: Specifying Xmodules and ownership for the example application described in Section 8.2.
Lemma 8.2 does not, however, guarantee all the properties we want from Xmodules (i.e., Defi-
nition 19). In particular, Lemma 8.2 does not consider any ordering of sibling Xmodules. The OAT
type system can, however, provide stronger guarantees for a program which satisfies what we call
the unique owner indices assumption: for all Xmodules A, all children of A in the module tree are
instantiated with ownership tags with unique indices that can be statically determined. For such a
program, the type system can order the children of every Xmodule A from smallest to largest index,
and assign the xid to each Xmodule as described in Section 8.2. Then, the following result holds:
Theorem 8.3 For a program with unique owner indices, in addition to Lemma 8.2, the OAT type
system guarantees that if the least common ancestor of Xmodules A and B in the module tree is
0B
. 
name, then A can call a method in B only ifxid(A) < xid(B).
PROOF.
We prove (by contradiction) that if least common ancestor of A and B in the module tree is
OB. name, and xid(A) > xid(B), then A can not have a formal tag with value oB. Therefore, it
can not declare a type with owner tag OB, and can not access B.
Let C be the least common ancestor of A. Since C = oB. name, we know that C is B's parent.
Let D be the ancestor of A which is B's sibling, and let oD be D's ownership tag (i.e., the tag with
which D is instantiated). Since B and D have the same parent (i.e. C) in the module tree, we have
OB. name = OD. name = C. Since xid(A) > xid(B), A is to the right of B in the ownership tree.
Therefore, D, which is an ancestor of A, is to the right of B in the ownership tree. Therefore, we
have OD. index > OB. index.
Assume for contradiction that A does have o1 as one of its tags. Using Lemma 8.1, one can
show that the only way for A to receive tag oB is if D also has a formal tag with value oB. Thus,
153
D's first formal owner tag has value OD and another one of its formals has value oB.
Let Eo0 = D, and consider the chain of Xmodule instantiations where Xmodule Ei instantiated
Ei 1. E1 has to instantiate D (which is the same as Eo) using its formal ownership tags (f,", f , ...),
where f' has value OD and ff has value og. (We must have f1 as the first formal, since OD is the
owner of D. Without loss of generality, we can have fb be the second formal since the type system
does not care about the ordering of formal tags after the first one.)
Since oB. name = oD. name = C, this chain of instantiations must lead back to C, since that
is the only Xmodule that can create ownership tags with values OB and OD in its class definition
(using the keyword this). 5 Let Ek = C.. For the class declaration of each of the Xmodules Ei
for 1 < i < k, the following must be true.
* E must have formals f, and fb, with values OD and OB, respectively, and E must pass these
formals into the instantiation of Pi_1.
* In the type definition of Pi's class, FP must have the constraint f, < f' on its formal tags
(either because ff is the owner tag, or through a where clause that enforces f, < f .
The first condition must hold for us to be able to pass both OB and oD down to P0 = Q. The
second condition is true for the Xmodules by induction. In the base case, P must know that
f, < fJ; otherwise, the type system will throw an error when it tries to instantiate P0 = Q with
owner fl. Then, inductively, Pi must know f' < fb to be able to instantiate Pi-1 .
Finally, Ek-1 is instantiated in the class file corrsponding to Ek = C. In this declaration, the
formal fk with value OD is instantiated with thi s[x]. Similarly, fbk with value oB is instantiated
with this[y]. Since the class definition of Pk type checks, we must have fk < fbk . This check
contradicts our original assumption that x > y however, since if x > y our type check should fail.
Therefore, we must have OD. indez < oB. index.
Theorem 8.3 only modifies the Condition 2b of Lemma 8.2. Therefore, Lemma 8.2 along
with Theorem 8.3 imposes restrictions on every Xmodule A which are only slightly weaker than
the restrictions required by Definition 19. Condition 1 in Lemma 8.2 corresponds to Rule 1 of
Definition 19. Conditions 2a and 2b are the cases permitted by Rule 2. Condition 2c, however,
corresponds to the special case of callbacks or calling a method from the same Xmodule, which is
not permitted by Definition 19. This case is. modeled differently, as we explained in Section 8.2.
The OAT type system is a best-effort type system to check for the restrictions required by Def-
inition 19. The OAT type system can not fully guarantee, however, that a type-checked program
does not violate Definition 19. Specifically, the OAT type system can not always detect the fol-
lowing violations statically. First, if the program does not have unique owner indices, then C may
instantiate both A and B with the same index. Then, by Lemma 8.2, A and B, can call each other's
methods, and we can get cyclic dependencies between Xmodules.6 Second, the program may per-
form improper callbacks. Say a method from A calls back to method B from C. An improper
callback B can call a method of B, even though the type system knows that A is to the right of B.
5Note that C could be the world Xmodule, in which case both oB and oD were created in the main function
using the world keyword.
6 Since all non-Xmodule objects are implicitly assigned higher indices than their Xmodule siblings, these non-
Xmodule objects can not introduce cyclic dependencies between Xmodules.
154
In both cases, the type system allows a program with cyclic dependency between Xmodules to pass
the type checks, which is not allowed by Definition 19.
To have an ownership-aware TM which guarantees exactly Definition 19, one needs to impose
additional dynamic checks. The runtime system can use the ownership tags to build a module tree
during runtime, and use this module tree to perform dynamic checks to verify that every Xmodule
has unique owner indices and contains only proper callbacks. The runtime system can do this
by dynamically inferring indices according to which Xmodule calls which other Xmodule, and
reporting an error if there is any circular calling.7
8.4 Computations with Xmodules
In this section, we formally define the structure of transactional programs with Xmodules. This
section converts the informal explanation from Section 8.2 into a formal model that we later use
to prove properties of ownership-aware TM. We build on top of the computation tree framework
described in Chapter 6. We add Xmodules and ownership to this framework, and provide the formal
statement of Definition 19.
As mentioned in Section 8.2 we consider programs that contain Xmodules. In our theoretical
framework, we consider traces generated by a program which is organized into a set / of Xmod-
ules. Each Xmodule A E / has some number of methods and a set of memory locations associated
with it.
We partition the set of all memory locations /M into sets of memory owned by each Xmodule.
Let modMemory(A) C M denote the set of memory locations owned by A. For a location f E
modMemory(A), we say that owner(f) = A. When a method of Xmodule A is called by a method
from a different Xmodule, a safe-nested transaction T is generated.8 We use the notation AT = A
to associate the instance T with the Xmodule A. We also define the instances associated with A as
modXactions(A) = {T E xactions(C) : AT = A .
As mentioned in Section 8.2, Xmodules of a program are arranged as the module tree, denoted
by D. Each Xmodule is assigned an xid according to a left-to-right depth-first tree walk, with the
root of D being world with xid = 0. Denote the parent of Xmodule A in D as modParent(A),
and the ancestors of A as modAnces(A) (include A itself). Similarly, let modDesc(A) be the set
of A's descendants. We say that Mroot(C) = world, i.e., the root of the computation tree is a
transaction associated with the world Xmodule.
We use the module tree D to restrict the sharing of data between Xmodules and to limit the
visibility of Xmodule methods according to the rules given in Definition 20.
Definition 20 (Formal Restatement of Definition 19) A program with a module tree D should
generate only traces (C, 4) which satisfy the following rules:
1. For any memory operation v which accesses a memory location f, let T = xparent [v]. Then
owner(f) E modAnces(MT).
7It is possible to statically check for unique owner indices by imposing additional restrictions on the program. We
opted, however, to describe a more flexible programming model with weaker static guarantees.
8As we explained in Section 8.2, callbacks are handled differently.
155
2. Let X, Y E xactions(C) be transaction instances such that MX = A and MY = B. We
can have X = xparent[Y] only ifmodParent(B) E modAnces(A), and xid(A) < xid(B).
8.5 The OAT Model
In this section, we describe the OAT model, an abstract execution model for TM with ownership
and Xmodules. The novel feature of the OAT model is that it uses the structure of Xmodules to
provide a commit mechanism which can be viewed as a hybrid of closed and open-nested commits.
The OAT model presents an operational semantics for TM, and is not intended to describe an actual
implementation, although these semantics can be used to guide an implementation.
Overview
The TM system is modeled as a nondeterministic state machine with two components: a program
and a runtime system. The runtime system, which we call the OAT model, dynamically constructs
and traverses a computation tree C as it executes instructions generated by the program. The OAT
model maintains a set of ready nodes, denoted by ready(C) C nodes(C), and at every step, the
OAT model nondeterministically chooses one of these ready nodes X E ready(C) to issue the
next instruction. The program then issues one of the following instructions (whose precondition
is satisfied) on X's behalf: fork, join, xbegin, xend, xabort, read, or write. For
shorthand, we sometimes say that X issues an instruction.
The OAT model describes a sequential semantics, that is, we assume at every time step t, a
program issues a single instruction. The parallelism in this model arises from the fact that at a
particular time, several nodes can be ready, and the runtime nondeterministically chooses which
node to issue an instruction.
In the rest of this section, we give a detailed description of the OAT model. First, we describe
the state information maintained by the OAT model and define the notation we use to refer to this
state. Second, we describe how the OAT model constructs and traverses the computation tree as
instructions are issued. Then, we describe how the OAT model handles memory operations (i.e.,
read and write), conflict detection, and transaction commits, and transaction aborts.
State information and notation
As the OAT model executes instructions, it dynamically constructs the computation tree C. For
each of the sets defined in Section 8.4 (e.g., nodes(C), spNodes(C), mem0ps(C), xactions(C),
etc.), we define corresponding time-dependent versions of these sets by indexing them with an
additional time argument. For example, we define the set nodes(t, C) denotes the set of nodes
in the computation tree after t time steps have passed. The generalized sets from Section 8.4 are
monotonically increasing, i.e., once an element is added to the set, it is never removed at a later
time t. Sometimes for shorthand, we omit the time argument when it is clear that we are referring
to a particular fixed time t.
At any time t, each internal node A E spNodes(t, C) has a status field status [A]. These status
fields change with time. If A E xactions(t, C), i.e., A is a transaction, then status[A] can be one
of COMMITTED, ABORTED, PENDING, or PENDING_ABORT. Otherwise, A c spNodes(t, C) -
xactions(t, C) is either a P-node or a nontransactional S-node; in this case, status[A] can either
156
i__~__~__ -ii_--~^_-^-lll~_li ...li.. iiii-iliii~~~-i- - - ~^-l^ --(-i- tlll_-lii-~~-iiii - i - _ i I-_I------~-L_-_ _.i.-_liii i ili . -i
be WORKING or SYNCHED. We define several abstract sets for the tree based on this status field.
The first 6 sets partition the spNodes(t, C), the set of internal nodes of the computation tree. The
last 4 sets categorize transactions and nodes as being either active or complete.
1. pending(t,C)= {X E xactions(t, C) : status[Z] = PENDINGI (Pending transactions).
2. pendingAbort(t,C) = {X E xactions(t,C) : status[Z] = PENDINGABORT} (Abort-
ing transactions).
3. committed(t,C) = {X E xactions(t, C) : status[Z] = COMMITTED} (Committed trans-
actions).
4. aborted(t, C) = {X E xactions(t, C) : status[Z] = ABORTED} (Aborted transactions).
5. working(t,C) = {Z E spNodes(t,C) - xactions(t,C) : status[Z] = WORKING} (Work-
ing nodes).
6. synched(t,C) = {Z E spNodes(t,C) - xactions(t,C) : status[Z] = SYNCHED} (Synched
nodes).
7. activeX(t)(C) = pending(t, C) U pendingAbort(t, C) (Active transactions).
8. activeN(t, C) = activeX(t)(C) U working(t, C). (Active nodes).
9. doneX(t, C) = committed(t, C) U aborted(t, C) (Complete transactions).
10. doneN(t, C) = doneX(t, C) U synched(t, C) (Complete nodes).
The OAT model maintains a set of ready S-nodes, denoted as ready(t, C). We discuss the prop-
erties of ready nodes later, in Section 8.5. Note that ready(t, C), and the sets defined above which
are subsets of act iveN(t, C) are not monotonic, because completing nodes removes elements from
these sets.
For the purposes of detecting conflicts, at any time t, for any active transaction T, i.e., T E
activeX(t)(C), the OAT model maintains a read set R(t, T) and a write set W(t, T) for T. The read
set R(t, T) is a set of pairs (f, v), where E M is a memory location and v E mem0ps(t, C) is a
memory operation that reads from f. We define W(I, T) similarly. We represent main memory as the
read set/write set of root(C). At time t = 0, we assume R(0, root(C)) and W(0, root(C)) initially
contain a pair (t, I) for all locations E A4M.
In addition to the basic read and write sets, we also define module read set and module write
set for all transactions T E activeX(t) (C). Module read set is defined as
modR(t, T) = {(, v) E R(t, T) : owner(l) = MT}.
In other words, modR(t, T) is the subset of R(t, T) that accesses memory owned by T's Xmodule
MT. Similarly, we define the module write set as
modW(t, T) = {(f, v) E W(t, T) : owner(f) = MT} .
The OAT model maintains two invariants on R(t, T) and W(t, T). First, W(t, T) C R(t, T) for
every transaction T E xactions(t, C), i.e., a write also counts as a read. Second, R(t, T) and
157
W(t, T) each contain at most one pair (f, v) for any location £. Thus, we use the shorthand £ E
R(t, T) to mean that there exists a node u such that (£, u) E R(t, T), and similarly for W(t, T). We
also overload the union operator: at some time t, an operation R(T) +- R(T) U { (f, u) } means we
construct R(t + 1, T) by
R(t + 1, T) = ({, u)} U (R(t, T) - {(£, u') C R(t, T)}).
In other words, we add (f, u) to R(T), replacing any (f, u') E R(t, T) that existed previously.
Constructing the computation tree
In the OAT model, the runtime constructs the computation tree in a straightforward fashion as
instructions are issued. For completeness, however, we give a detailed description of this construc-
tion.
Initially, at time t = 0, we begin with only the root node in the tree, i.e., nodes(O, C) =
xactions(0, C) = {root(C)}. This root node also begins as ready, i.e., ready(0, C) = {root(C)}.
Throughout the computation, the status of the root node of the tree is always PENDING.
A new internal node is created if the OAT model picks ready node X and X issues a fork or
xbegin instruction. If X issues a fork, then the runtime creates a P-node P as a child of X,
and two S-nodes S1 and S2 as children of P,, all with status WORKING. The fork also removes X
from ready(C) and adds Si and S2 to ready(C). If X issues an xbegin, then the runtime creates
a new transaction Y E xactions(C) as a child of X, with status[Y] = PENDING, removes X
from ready(C), and adds Y to ready(C).
The OAT model completes a nontransactional S-node Z E ready(t,C) - xactions(t,C)
(which must have status[Z] = WORKING) by having Z issue a join instruction. The join
instruction first changes status[Z] to SYNCHED. In the tree, since parent [Z] is always a P-node,
Z has exactly one sibling. If Z is the first child of parent [Z] to be SYNCHED, the OAT model
removes Z from ready(C). Otherwise, Z is the last child of parent[Z] to be SYNCHED, and the
OAT model removes Z and parent [Z] from ready(C), changes the status of both Z and parent [Z]
to SYNCHED, and adds parent [parent [Z]J] -to ready(C).
The OAT model can complete a transaction X E ready(t, C) by having it issue either an
xend or xabort instruction. If status[X] - PENDING, then X can issue an xend to change
status[X] to COMMITTED. Otherwise, status[X] = PENDINGIABORT, and X can issue an
xabort to change its status to ABORTED. For both xend and xabort, the OAT model removes
X from ready(C) and adds parent[X] back into ready(C). The xend instruction also performs
an ownership-aware commit and changes read sets and write sets, which we describe later in Sec-
tion 8.5.
Finally, a ready node X issues a read and write instruction, if the instruction does not
generate a conflict, it adds a memory operation node v to memops(t, C), with v as a child of X.
If the instruction would create a conflict, the runtime may change the status of one PENDING
transaction T to PENDINGABORT to make: progress in resolving the conflict. For shorthand, we
refer to the status change of a transaction T from PENDING to PENDING-ABORT as a sigabort
of T.
This construction of the tree guarantees a few properties.
First, the sequence of instructions S generated by the OAT model is a valid topological sort of
the computation dag G(C). Second, the OAT model generates a tree of a canonical form, where the
158
;i~;__:ii _....... . ... -----------ii ii;:;-- i:i;i: . - --l:ii-i- - -ii -~_ii- i iI~li:i.:.:i .~ii~i _ _i
root node of the tree is a transaction, all transactions are S-nodes and every P-node has exactly two
nontransactional S-node children. This canonical form is imposed for convenience of description;
it is not important for any theoretical results. Finally, the OAT model maintains the invariant the
active nodes form a tree, with the ready nodes at the leaves. This property is important for the
correctness of the OAT model.
Memory operations and conflict detection
The OAT model performs eager conflict detection; before performing a memory operation that
would create a new v E mem0ps(C), the OAT model first checks whether creating v would cause a
conflict, according to Definition 21.
Definition 21 Suppose at time t, the OAT model issues a read or wri te instruction that poten-
tially creates a memory operation node v. We say that v generates a memory conflict if there exists
a location EC M and an active transaction T, E act iveX(t) (C) such that
1. T, 0 xAnces(v), and
2. either R(v, f) A ((f, u) E W(t, T,)), or W(v, f) A ((f, u) E R(t, Tu)).
If a potential memory operation v would generate a conflict, then the memory operation v does
not occur; instead, a sigabort of some transaction may occur. We describe the mechanism for
aborts in Section 8.5. Otherwise, v does not generate a conflict and observes the value £ from R(Y),
where Y is the closest ancestor of v with f in its readset (i.e., (f, u) E R(Y) and (4(v) = u). The
read also adds v to X's readset. A successful write operation v sets the observer function 4I(v)
in the same way as a read. The write adds (£, v) to both R(X) and W(X).
Ownership-aware transaction commit
The ownership-aware commit mechanism employed by the OAT model contains elements of both
closed-nested and open-nested commits. A PENDING transaction Y issues an xend instruction to
commit Y into X = xparent [Y]. This xend commits locations from its read and write sets which
are owned by MY in an open-nested fashion to the root of the tree, while it commits locations
owned by other Xmodules in a closed-nested fashion, merging those reads and writes into X's read
and write sets.
We can describe the OAT model's commit mechanism more formally in terms of module read-
sets and writesets. Suppose at time t, Y E xactions(t, C) with status[Y] = PENDING issues an
xend. This xend changes readsets and writesets as follows.
R(root(C)) <-- R(root(C))UmodR(Y)
R(xparent[Y]) +- R(xparent[Y]) U (R(Y) - modR(Y))
w(root(C)) <- W(root(C)) UmodW(Y)
w(xparent[Y]) <- W(xparent[Y]) U (W(Y) - modW(Y))
159
Unique committer property
Definition 20 guarantees certain properties of the computation tree which are essential to the
ownership-aware commit mechanism. Theorem 8.5 proves that every memory operation has one
and only one transaction that is responsible for committing the memory operation. The proof of the
theorem requires the following lemma.
Lemma 8.4 Given a computation tree C, for any T E xactions(C), let ST = {MT' : T' E xAnces(T)}.
Then modAnces(MT) C_ ST.
PROOF. We prove this fact by induction on the nesting depth of transactions T in the computation
tree. In the base case, the top-level transaction T = root(C), and Mroot(C) - world. Thus,
the fact holds trivially. For the inductive step, assume that modAnces(MT) C ST holds for any
transaction T at depth d. We show that the fact holds for any T* E xactions(C) at depth d + 1.
For any such T*, we know T = xparent[T*] is at depth d. Then, by Rule 2 of Definition 20,
we have modParent(MT*) E modAnces(ATT). Thus, modAnces(MT*) C modAnces(MT) U
{MT*}. By construction of the set ST, we have ST. = ST U {MT*}. Therefore, using the
inductive hypothesis, we have modAnces(MT*) C ST*. []
Theorem 8.5 If a memory operation v accesses a memory location £, then there exists a unique
transaction T* E xAnces(v), such that
1. owner(f) = MT*, and
2. For all transactions X E pAnces(T*) n xactions(C), X can not directly access location £.
This transaction T* is the committer of memory operation v, denoted committer(v).
PROOF. This result follows from the properties of the module tree and computation tree stated in
Definition 20.
Let T = xparent[v]. First, by Definition 20, Rule 1, we know owner(e) E modAnces(MT).
We know modAnces(MT) C ST by Lemma 8.4. Thus, there exists some transaction T* E
xAnces(T) such that owner(e) = MT*. We can use Rule 2 to show that the T* is unique. Let
Xi be the chain of ancestor transactions of T, i.e., let Xo = T, and let Xi = xparent [Xi- 1], up
until Xk = root(C). By Rule 2, we know xid(MXi) < xid(MXil), that is, the xids strictly
decrease walking up the tree from T. Thus, there can only be one ancestor transaction T* of T with
xid(MT*) = xid(owner(e)).
To check the second condition of Theorem 8.5, consider any X E pAnces(T*) n xactions(C).
By Rule 1, X can access f directly only if owner(e) E modAnces(MX) implying that xid(owner(f)) _
xid(MX). But we know that owner(e) = MT* and xid(MT*) > xid(MX). O
Intuitively, T* = committer(v) is the transaction which "belongs" to the same Xmodule as
the location e which v accesses, and is "responsible" for committing v to memory and making it
visible to the world. The second condition of Theorem 8.5 states that no ancestor transaction of T*
in the call stack can ever directly access £; thus, it is "safe" for T* to commit e.
160
Transaction abort
When the OAT model detects a conflict, it aborts one of the conflicting transactions by changing its
status from PENDING to PENDINGABORT. In the OAT model, a transaction X might not abort
immediately; instead, it might continue to issue more instructions after its status has changed to
PENDINGABORT. Later, it will be useful to refer to the set of operations a transaction X issues
while its status is PENDINGABORT.
Definition 22 The set of operations issued by X or descendants of X after status[X] changes to
PENDINGABORT are called X 's abort actions. This set is denoted by abortactions(X).
The PENDINGABORT status allows X to compensate for the safe-nested transactions that
may have committed; if transaction Y is nested inside X, then the abort actions of X contain the
compensating action of Y. Eventually a PEND ING-ABORT transaction issues an xend instruction,
which changes its status from PENDINGABORT to ABORTED.
If a potential memory operation v generates a conflict with T" and T,'s status is PENDING,
then the OAT model can nondeterministically choose to abort either xparent[v], or T,. In the
latter case, v waits for T, to finish aborting (i.e., change its status to ABORTED) before continuing.
If T,'s status is PENDINGABORT, then v just waits for T to finish aborting before trying to issue
read or write again.
This operational model uses the same conflict detection algorithm as TM with ordinary closed-
nested transactions does; the only subtleties are that v can generate a conflict with a PENDINGABORT
transaction T,,, and that transactions no longer abort instantaneously because they have abort ac-
tions. Some restrictions on the abort actions of a transaction may be necessary to avoid deadlock,
as we describe later in Section 8.7.
8.6 Serializability by Modules
In this section, we define serializability by modules, a definition inspired by the database notion of
multilevel serializability (e.g., as described in [Wei86]). First, we describe the definition of serial-
izability in the transactional computation framework, as given in [ALS06]. Next, we incorporate
Xmodules into this definition and define serializability by modules. We then prove that the OAT
model guarantees serializability by modules. Finally, we discuss the relationship between the def-
inition of serializability by modules, and the notion of abstract serializability for the methodology
of open nesting.
Transactional computations and serializability
In Chapter 6, we defined serializability formally for TM systems with closed and open nesting.
In this section, we extend that definition to ownership-aware transactions. In order to do so, we
first define content sets more generally than in Chapter 6, and extend the definition of hidden
vertices. Once we have done so, the definitions of serializability and transactional serializabilty
automatically generalize. Informally, a trace (C, 4) is serializable if there exists a topological sort
order S of G(C) such that S is "sequentially consistent with respect to (", and all transactions
appear contiguous in the order S.
Content sets
We first describe some notation needed to formally describe serializability by modules. All defini-
tions in this section are aposteriori, i.e., they are defined on the computation tree after the program
has finished executing.
We define "content" sets for every transaction T by partitioning mem0ps(T) (all the memory
operations enclosed inside T including those belonging to its nested transactions) into three sets:
cContent(T), oContent(T) and aContent(T). For any u c memOps(T), we define the content
sets based on the final status of transactions in C that one visits when walking up the tree from u to
T.
Definition 23 For any transaction T andmemory operation u, define the sets cContent(T), oContent(T),
and aContent(T) according the ContentType(u, T) procedure:
ContentType(u, T) > For any u G memOps(T)
1 X <- xparent[u]
2 while (X 4 T)
3 if (X is ABORTED) return a G aContent(T)
4 if(X = committer(u)) return u E oContent(T)
5 X +- xparent[X]
6 return u E cContent(T)
Recall that in the OAT model, the safe-nested commit of T commits some memory opera-
tions in an open-nested fashion, to root(C),, and some operations in a closed-nested fashion, to
xparent[T]. Informally, oContent(T) is the set of memory operations that are committed in an
"open" manner by T's subtransactions. Similarly, aContent(T) is the set of operations that are
discarded due to the abort of some subtransaction in T's subtree. Finally, cContent(T) is the set
of operations that are neither committed in an "open" manner, nor aborted.
Sequential consistency with transactions
For computations with transactions, we can modify the classic notion of sequential consistency
to account for transactions which abort. Transactional semantics dictate that memory operations
belonging to an aborted transaction T should not be observed by (i.e., are hidden from) memory
operations outside of T.
Definition 24 For u E mem0ps(C), v E V(C), let X = xLCA(u, v). We say that u is hidden from v
ifu E aContent(X).
Our definition of serializability by modules requires that computations satisfy some notion of
sequential consistency, generalized for the setting of TM. Here is the definition of transactional last
writer function from Chapter 6.
Definition 25 Consider a trace (C, D) and a topological sort S of G(C). For all v C mem0ps(C)
such that R(v, f) V W(v, £), the transactional last writer of v according to S, denoted Xs(v), is
the unique u G mem0ps (C) U {I} that satisfies four conditions:
1. W(u, £),
162
I
2. u <s v,
3. -(uHv), and
4. Vw (W('w,) A (u <S w <s v)) w=> '.
Definition 26 A trace (C, (1) is sequentially consistent if there exists a topological sort S such that
(b = Xs. We say that S is sequentially consistent with respect to 4.
In other words, the transactional last writer of a memory operation u which accesses location
£, is the last write v to location £ in the order S, except we skip over writes w which are hidden
from (i.e., aborted with respect to) u. Intuitively, Definition 12 requires that there exists an order S
explaining all the memory operations of the computation.
Serializability
Definition 27 A trace (C, D) is serializable if there exists a topological sort S that satisfies two
conditions:
1. D = Xs (S is sequentially consistent with nrspect to 4), and
2. VT G xactions(C) and Vv E V(C), we have xbegin(T) <s v <s xend(T) implies
vE V(T)).
Ordinary serializability can be thought of as a strengthening of sequential consistency which also
requires that the order S both explains all memory operations, and also has all transactions appear-
ing contiguous.
Defining serializability by modules
In Chapter 6, a trace (C, 4) was said to be serializable if there exists a topological sort S of G(C)
such that S is sequentially consistent with respect to 4, and all transactions appear contiguous in S.
Serializability in this context can be thought of as a sequential consistency plus the requirement that
transactions are atomic. This definition of serializability is the "correct definition" for flat or closed-
nested transactions. This definition of serializability is too strong, however, for ownership-aware
transactions. A TM system that enforces this definition of serializability can not ignore lower-level
memory accesses when detecting conflicts for higher-level transactions.
Instead, we describe a definition of serializability by modules which checks for correctness of
one Xmodule at a time. Given a trace (C, Qb), for each Xmodule A, we transform the tree C into
a new tree mTree(C, A). The tree mTree(C, A) is constructed in such a way as to ignore memory
operations of Xmodules which are lower-level than A, and also to ignore all operations which are
hidden from transactions of A. For each Xmodule A, we check that the transactions of A in the
trace (mTree(C, A), 14) is serializable. If the check holds for all Xmodules, then trace (C, 4) is said
to be serializable by modules.
Definition 28 formalizes the construction of mTree(C, A).
163
Definition 28 For any computation tree C, let mTree(C, A) be the result ofmodifying C as follows:
1. For all memory operations u E memOps(C) with u accessing f, if owner(f) = Bfor some
xid(B) > xid(A), convert u into a nop.
2. For all transactions T E modXactions(A), convert all u E aContent(T) into nops.
The intuition behind Condition 1 of Definition 28 is the following. When looking at Xmodule A,
we throw away memory operations belonging to a lower-level Xmodule B, since by Theorem 8.5,
transactions of A can never directly access the same memory as those operations anyway. In Con-
dition 2, we ignore the content of any aborted transactions nested inside transactions of A; those
transactions might access the same memory locations as operations which we did not turn into
nops, but those operations are aborted with respect to transactions of A.
Lemma 8.6 argues that if a trace (C, P), is sequentially consistent, then (mTree(C, A), (1) is
a valid trace; an operation u that remains in the trace never attempts to observe a value from a
4 (u) which was turned into a nop due to Definition 28. In addition, the transformed trace is also
sequentially consistent.
Lemma 8.6 Let (C, P) be any sequentially consistent trace. Then for any Xmodule A, (mTree(C, A), <
is a valid trace. In other words, ifu E memOps(mTree(C, A)), then D(u) E mem0ps(mTree(C, A)).
Furthermore, any S which is sequentially consistent with respect to P in (C, b) is also sequentially
consistent with respect to b in (mTree(C, A), 4).
PROOF. In the new tree mTree(C, A), pick any u E mem0ps(mTree(C, A)) which remains. As-
sume for contradiction that v = P (u) was turned into a nop in one of Steps 1 and 2.
If v was turned into a nop in Step 1 of Definition 28, then we know because v accessed a
memory location £ where xid(owner(£)) > xid(A). Since u must access the same location £, u
must also be converted into a nop.
If v was turned into a nop in Step 2 of Definition 28, then v E aContent (T) for some MT = A.
Then we can show that either vHu, or a should have also been turned into a nop. Let X =
xLCA(v, u). Since X and T are both ancestors of v, either X is an ancestor of T or T is a proper
ancestor of X.
1. First, suppose T is a proper ancestor of X. Consider the path of transactions Yo, Y 1,... Yk,
where Yo = xparent[v], xparent [Y] Y+1, and xparent [Yk] = T. Since v E aContent(T),
for some Yj for 0 < j < k must have status[Yj] = ABORTED. Since T is a proper ancestor
of X, X = Y, for some x satisfying 0 < x < k.
(a) If status[Yj] = ABORTED for any j satisfying 0 < j < x, then we know v E
aContent(X), and thus vHu. Since we assumed (C, Q) is sequentially consistent and
D(u) = v, by Definition 25, we know -ivHu, leading to a contradiction.
(b) If Yj is ABORTED for any j satisfying x < j _ k, then status[Y] = ABORTED
implies that u E aContent(X), and thus, u should have been turned into a nop, contra-
dicting the original setup of the statement.
2. Next, consider the case where X is an ancestor of T. Since v e aContent(T), we have
v E aContent(X). Therefore, this case is analogous to Case la above.
164
Finally, if 4) is the transactional last writer according to S for (C, 4)), it is still the transactional
last writer for (mTree(C, A), 4)) because the memory operations which are not turned into nops
remain in the same relative order. Thus, the last condition is satisfied. ]
Note that Lemma 8.6 depends on the restrictions on Xmodules described in Definition 20.
Without this structure of modules and ownership, the construction of Definition 28 is not guaranteed
to generate a valid trace.
Finally, we can define serializability by modules.
Definition 29 A trace (C, 4) is serializable by modules if it is sequentially consistent, and iffor all
Xmodules A in D, there exists a topological sort S of CA = mTree(C, A) such that:
1. S is sequentially consistent with respect to 4P, and
2. For the tree CA, VT E modXactions(A) andVv E V(CA), if we have xbegin(T) <s v <s
xend(T), then v E V(T).
Informally, a trace (C, 4) is serializable by modules if it is sequentially consistent, and if for every
Xmodule A, there exists a sequentially consistent order S for the trace (mTree(C, A), 4) such that
all transactions of A are contiguous in S.
OAT model guarantees serializability by modules
In this section, we show that the OAT model described in Section 8.5 generates traces (C, 4) that are
serializable by modules, i.e., that satisfy Definition 29. The proof of this fact consists of three steps.
First, we generalize the notion of "prefix race-freedom" described in [ALS06], to computations
with Xmodules. Second, we prove that the OAT model guarantees that a program execution is
prefix race-free. Finally, we argue that any trace which is prefix race-free is also serializable by
modules.
Defining prefix race-freedom
First, we define prefix races. These definitions are essentially the same as those in [ALS06], ex-
cept adapted for a system with an ownership-aware commit mechanism instead of an open-nested
commit mechanism.
Definition 30 For any execution order S, for any transaction T E xactions(C), consider any
v 0 mem0ps(T) such that xbegin(T) <s v <s xend(T). We say there exists a prefix race
between T and v if there exists a memory operation w E cContent(T) s.t., w <s v, -(vHw), v
and w both access £, and one of v, w writes to £.
Definition 31 A trace (C, 4)) is prefix race-free if exists a topological sort S of G(C) satisfying
two conditions:
1. 4 = Xs (S is sequentially consistent with respect to 4), and
2. Vv E V(C) and VT E xactions(C) there is no prefix race between v and T.
S is called a prefix race-free sort of the trace.
165
Properties of the OAT model
Second, we prove several invariants that OAT model preserves, and then use these invariants to
prove that the OAT model generates only traces (C, -1) which are prefix race-free.
The sequence of instructions that the OAT model issues naturally generates a topological sort S
of the computation dag G(C): the fork and xbegin instructions correspond to the begin nodes
of a parallel or series blocks in the dag, the ji oin, xend, and xabort instructions correspond to
end nodes of parallel or series blocks, and the read or write instructions correspond to memory
operation nodes v E memOps(C).
Theorem 8.7 Suppose the OAT model generates a trace (C, ID) and an execution order S. Then,
= Xs, i.e., S is sequentially consistent with respect to 4.
PROOF. This result is reasonably intuitive, but the proof is tedious and somewhat complicated.
We defer the details of this proof to Appendix .2. O
Next, we describe an invariant on readsets and writesets that the OAT model maintains. In-
formally, Lemma 8.8 states that, if a memory operation u that reads (writes) location f is in the
cContent(T) for some transaction T, then £ belongs to the read set (write set) of some active
transaction under T's subtree between the time when the memory operation is performed and the
time when T ends.
Lemma 8.8 Suppose the OAT model generates a trace (C, P) with an execution order S. For any
transaction T, consider a memory operation u E cContent(T) which accesses memory location f
at step to. Let tf be step when xend(T) or xabort (T) happens. At any time t such that to < t < tf
there exists some T' C xDesc(T) n activeXt) (C) (i.e., T' is an active transactional descendant of
T) such that
1. If R(u, f), then £ E R(t, T').
2. If W(u, £), then £ e W(t, T').
PROOF. Let X 1, X 2 , ... Xk be the chain of transactions from xparent [u] up to, but not including
T, i.e., X1 = xparent[u], Xj = xparent:[Xj_], and xparent[Xk] = T. Since we assume that
u E cContent(T) and since T completes at time tf, for every j such that 1 < j < k, there exists
a unique time tj (satisfying to < tj < tf) when an xend changes status[Xj] from PENDING to
COMMITTED; otherwise, we would have u E aContent(T).
Also, by Theorem 8.5 and Definition 23, we know committer(u) E xAnces(T), i.e., none of
the Xj's will commit location £ in an open-nested fashion to the world; otherwise, we would have
u E oContent(T).
First, suppose R(u, 0). At time ti, when the memory operation u completes, (0, u) is added to
R(X1). In general, at time tj, the ownership-aware commit mechanism, as described in Section 8.5,
will propagate f from R(Xj) to R(Xj+1). Therefore, for any time t in the interval [tj-1, ti), we
know f E R(t, Xj), i.e., for Lemma 8.8, T' - Xj. Similarly, for any time t in the interval [tk, tf),
we have f E R(t, T), i.e., we choose T' = T.
The case where W(u, 0) is completely analogous to the case of R(u, 0), except we have both
0 E R(t, T') and f E W(t, T'). O
We use Theorem 8.7 and Lemma 8.8 to, prove that the OAT model generates traces which are
prefix race-free.
166
--- 
-- 
--
Theorem 8.9 Suppose the OAT model generates a trace (C, 1i) with an execution order S. Then S
is an prefix race-free sort of (C, 4).
PROOF.
For the first condition of Definition 31, we know by Theorem 8.7 that the OAT model generates
an order S which is sequentially consistent with respect to 4.
To check the second condition, assume for contradiction that we have an order S generated by
the OAT model, but there exists a prefix race between a transaction T and a memory operation
v mem0ps(T). Let w be the memory operation from Definition 30, i.e., w E cContent(T),
w <s v <s xendT, -(vHw), w and v access the same location £, with one of the accesses being
a write. Let t, and t, be the time steps in which operations w and v occurred, respectively, and let
tendT be the time at which either xend(T) or xabort(T) occurs (i.e., either T commits or aborts).
We argue that at time t,, the memory operation v should not have succeeded because it generated a
conflict.
There are three cases for v and w. First suppose W(v, f) and R(w, f). Since tw < t, < tendT,
by Lemma 8.8, at time tv, f is in the writeset of some active transaction T' E desc(T). Since v
memOps(T), we know T ances(v). Thus, since T' is a descendant of T, we have T' 0 ances(v).
Since T' 0 ances(v), by Definition 21, at time .1,, v generates a conflict with T'. The other two
cases, where R(v, £) A W(w, f) or W(v, £) A W(a, f), are analogous.
Prefix race-freedom implies serializability by modules
Finally, we show that a trace (C, 4) which is prefix race-free is also serializable by modules.
Theorem 8.10 Any trace (C, (P) which is prefix race-free is also serializable by modules.
PROOF.
First, by Definition 28 and Lemma 8.6, it is easy to see that a prefix-race free sort S of a trace
(C, 4) is also prefix-race free of the sort (mTree(C, A), ) for any Xmodule A. Now we shall argue
that for any Xmodule A, we can transform S into SA such that all transactions in xactions(A)
appear contiguous in SA.
Consider a prefix-race free sort S of (mTree(C, A), 4) which has k nodes v which violate the
second condition of Definition 29. We show how to construct a new order S' which is still a prefix
race-free sort of (mTree(C, A), 4), but which has only k - 1 violations.
We reduce the number of violations according to the following procedure:
1. Of all transactions T E modXactions(A) such that there exists an operation v such that
xbegin(T) is v <s xend(T) and v V (T), choose the T = T* which has the latest
xend(T) in the order S.
2. In T*, pick the first v 0 V(T*) which causes a violation.
3. Create a new sort S' by moving v to be immediately before xbegin(T*).
In order to argue that S' is still a prefix race-free sort of (mTree(C, A), 4), we need to show that
moving v does not generate any new prefix races, and does not create a sort S' which is no longer
sequentially consistent with respect to 14 (i.e., that 1D is still the transactional last writer according
to S'). There are three cases: v can be a memory operation, an xbegin(T'), or an xend(T').
167
1. Suppose v is a memory operation which accesses location f. For all operations w such that
xbegin(T) <s w <s v, we argue that w can not access the same location £ unless both w and
v read from £. Since we chose v to be the first memory operation such that xbegin(T) <S
v <s xend(T) such that v 0 V(T), we know w G V(T). We know by construction of
mTree(C, A), that w E cContent(T) (if w e oContent(T) or w e aContent(T), then
steps 1 or 2, respectively, in Definition 28 will turn w into a nop). Therefore, by Definition 30,
unless w and v both read from £, v has a prefix race with T, contradicting the fact that S is
a prefix race-free sort of the trace. Thus, moving v to be before xbegin(T) can not generate
any new prefix races or change the transactional last writer for any memory operation, and
S' is still a prefix race-free sort of the trace.
2. Next, suppose v = xbegin(T'). Moving xbegin(T') can not generate any new prefix
races with T', because the only memocry operations u which satisfy xbegin(T) <s u <s
xbegin(T') satisfy u ( cContent(T'). Also, moving xbegin(T') does not change the
transactional last writer for any node v because the move preserves the relative order of all
memory operations. Therefore, S' is still a prefix race-free sort.
3. Finally, suppose v = xend(T'). By moving xend(T') to be before xbegin(T), we can
only lose prefix races with T' that already existed in S because we are moving nodes out of
the interval [xbegin(T'), xend(T')]. Also, as with xbegin(T'), moving xend(T') does not
change any transaction last writers. Therefore, S' is still a prefix race-free sort of the trace.
Since we can eliminate violations of the second condition of Definition 29 one at a time, we
can construct a sort SA which satisfies serializability by modules by eliminating all violations. El
Finally, we can prove the OAT model guarantees serializability by modules by putting the pre-
vious results together.
Theorem 8.11 Any trace (C, 4) generated by the OAT model is serializable by modules.
PROOF. By Theorem 8.9, the OAT model generates only trace (C, 4) which are prefix race-free.
By Theorem 8.6, any trace (C, P) which is prefix race-free is serializable by modules. l
Abstract serializability
By Theorem 8.11, the OAT model guarantees serializability by modules. We now relate this def-
inition to the notion of abstract serializability used in multilevel database systems [Wei86]. As
we mentioned in Section 8.1, the ownership-aware commit mechanism is a part of a methodology
which includes abstract locks and compensating actions. In this section we argue that OAT model
provides enough flexibility to accommodate abstract locks and compensating actions. In addition,
if a program is "properly locked and compensated," then serializability by modules guarantees
abstract serializability.
The definition of abstract serializability in [Wei86] assumes that the program is divided into
levels, and that a transaction at level i can only call a transaction at level i + 1.9 In addition,
transactions at a particular level have predefined commutativity rules, i.e., some transactions of the
9We assume level number increases as you go from a higher level to a lower-level to be consistent with our num-
bering of xid. In the literature (e.g. [Wei86]), levels typically go in the opposite direction.
168
same Xmodule can commute with each other and some can not. The transactions at the lowest level
(say k) are naturally serializable; call this schedule Zk. Given a serializable schedule Zi+1 of level-
i + 1 transactions, the schedule is said to be serializable at level i if all transactions in Zi+ 1 can be
reordered, obeying all commutativity rules, to obtain a serializable order Zi for level-i transactions.
The original schedule is said to be abstractly serializable if it is serializable for all levels.
These commutativity rules might be specified using abstract locks [NMAT+07]: if two trans-
actions can not commute, then they grab the same abstract lock in a conflicting manner. In the
application described in Section 8.2, for instance, transactions calling insert and remove on
the BST using the same key do not commute and should grab the same write lock. Although ab-
stract locks are not explicitly modeled in the OAT model, we can model transactions acquiring the
same abstract lock as transactions writing to a common memory location £.0o Locks associated
with an Xmodule A are owned by modParent(A). A module A is said to be properly locked if the
following is true for all transactions T1 , T2 with MT1 = MT 2 = A: if T, and T2 do not commute,
then they access some f E modMemory(modParent(A)) in a conflicting manner.
If all transactions are properly locked, then serializability by modules implies abstract serial-
izability (as defined above) in the special case when the module tree is a chain (i.e., each non-
leaf module has exactly one child). Let Si be the sort S in Definition 29 for Xmodule A with
xid(A) = i. This Si corresponds to Zi in the definition of abstract serializability.
In the general case for ownership-aware TM, however, by Rule 2 of Definition 19, we know a
transaction at level i might call transactions from multiple levels x > i, not just = i + 1. Thus, we
must change the definition of abstract serializability slightly; instead of reordering just Zi+ while
serializing transactions at level-i, we have to potentially reorder Zx for all x where transactions
at level i can call transactions at level x. Even in this case, if every module is properly locked
(by the same definition as above), one can show serializability by modules guarantees abstract
serializability.
The methodology of open nesting often requires the notion of compensating actions or inverse
actions. For instance, in a BST, the inverse of insert is remove with the same key. When a
transaction T aborts, all the changes made by its subtransactions must be inverted. Again, although
the OAT model does not explicitly model compensating actions, it allows an aborting transaction
with status PEND I NGABORT to perform an arbitrary but finite number of operations before chang-
ing the status to ABORTED. Therefore, an aborting transaction can compensate for all its aborted
subtransactions.
8.7 Deadlock Freedom
In this section, we argue that the OAT model described in Section 8.5 can never enter a "semantic
deadlock" if we impose suitable restrictions on the memory accessed by a transaction's abort ac-
tions. In particular, an abort action generated by transaction T from MT should read (write) from
a memory location f belonging to modAnces(MT) only if f is already in R(T) (W(T)). Under these
conditions, we show that the OAT model can always "finish" reasonable computations.
An ordinary TM without open nesting and with eager conflict detection never enters a seman-
tic deadlock because it is always possible to finish aborting a transaction T without generating
additional conflicts; a scheduler in the TM runtime can abort all transactions, and then complete
10More complicated locks can be modeled by generalizing the definition of conflict.
169
the computation by running the remaining transactions serially. Using the OAT model, however,
a TM system can enter a semantic deadlock because it can enter a state in which it is impossible
to finish aborting two parallel transactions T1 and T2 which both have status PENDINGABORT. If
T 's abort action generates a memory operation u which conflicts with T2, then u will wait for T2
to finish aborting (i.e., when the status of T2 becomes ABORTED). Similarly, T2's abort action can
generate an operation v which conflicts with T and waits for T1 to finish aborting. Thus T1 and T2
can both wait on each other, and neither transaction will ever finish aborting.
Defining semantic deadlock
Intuitively, we want to say that the OAT model exhibits a semantic deadlock if it causes the TM
system to enter a state in which it is impossible to "finish" a computation because of transaction
conflicts. A computation might not finish for other reasons, such as an infinite loop or livelock.
This section defines semantic deadlock precisely and distinguishes it from these other reasons for
noncompletion.
Recall that our abstract model has two entities: the program, and a generic operational model
F representing the runtime system. At any time t, given a ready node X E ready(C), the program
chooses an instruction and has X issue the: instruction. If the program issues an infinite number of
instructions, then F can not complete the program no matter what it does. To eliminate programs
which have infinite loops, we only consider bounded programs.
Definition 32 We say that a program is bounded for an operational model if any computation
tree that F generates for that program is of afinite depth, and there exists afinite number K such
that at any time t, every node B E nodes.(t, C) has at most K children with status PENDING or
COMMITTED.
Even if the program is bounded, it might run forever if it livelocks. We use the notion of a
schedule to distinguish livelocks from semantic deadlocks.
Definition 33 A schedule F on some time interval [to, tl] is the sequence of nondeterministic
choices made by an operational model in the interval.
An operational model F makes two types of nondeterministic choices. First, at any time t, F non-
deterministically chooses which ready node X E ready(C) executes an instruction. This choice
models nondeterminism in the program due to interleaving of the parallel executions. Second,
while performing a memory operation u which generates a conflict with transaction T, F nonde-
terministically chooses to abort either xparent [u] or T. This nondeterministic choice models the
contention manager of the TM runtime. A program may livelock if F repeatedly makes "bad"
scheduling choices.
Intuitively, an operational model deadlocks if it allows a bounded computation to reach a state
where no schedule can complete the computation after this point.
Definition 34 Consider an F executing a bounded computation. We say that F does not exhibit a
semantic deadlock iffor allfinite sequences ofto instructions that F can issue that generates some
intermediate computation tree Co, there exists afinite schedule F on [to, tl] such that F brings the
computation tree to a rest state C1, i.e., ready (CI) = {root(C1)}.
This definition is sufficient, since once the computation tree is at the rest state, and only the root
node is ready, F can execute each transaction serially and complete the computation.
170
;:-'~;~"l""~~!~r~i (  ix - ~i;:~-  i. ...... ..... ......... ;  ----l ii - i: i ;; ;I;_; ~; l; _.~ :-ii:---i--!---- l---i - i~;-~i- ;ii =-i;
Restrictions to avoid semantic deadlock
The general OAT model described in Section 8.5 exhibits semantic deadlock because it may enter
a state where two parallel aborting transactions 7] and T2 keep each other from completing their
aborts. For a restricted set of programs, where a PENDINGABORT transaction T never accesses
new memory belonging to Xmodules at MT's level or higher, however, we can show the OAT
model is free of semantic deadlock.
More formally, for all transactions T, we restrict the memory footprint of abortactions(T).
Definition 35 An execution (represented by a computation tree C) has abort actions with limited
footprint if the following condition is true for all transactions T E aborted(C). At time t, if a
memory operation v E abortactions(T) accesses location f and owner(f) E modAnces(MT),
then (1) if v is a read, then £ E R(T), and (2) if v is a write then £ E W(T).
Intuitively, Definition 35 requires that once a transaction T's status becomes PENDING-ABORT,
any memory operation v which T or a nested transaction inside T performs to finish aborting T
can not read from (write to) any location £ which is owned by any Xmodules which are ancestors
of MT (including MT itself), unless £ is already in the read (or write set) of T.
First, we show that the properties of Xmodules from Theorem 8.5 in combination with the
ownership-aware commit mechanism imply that transaction read sets and write sets exhibit nice
properties. In particular, we have Corollary 8.12, which states that a location £ can appear in the
read set of a transaction T only if T's Xmodule is a descendant of owner(£) in the module tree D.
Corollary 8.12 For any transaction T if £ E R(T), then MT E modDesc(owner(£)).
PROOF. Follows from Definition 19 and Theorem 8.5, and induction on how a location £ can
propagate into readsets and writsets using the ownership-aware commit mechanism. l
If all abort actions have a limited footprint, we can show that operations of an abort action of
an Xmodule A can only generate conflicts with a "lower-level" Xmodule.
Lemma 8.13 Suppose the OAT model generates an execution where abort actions have limited
footprint. For any transaction T, consider a potential memory operation v E abortact ions(T).
If v conflicts with transaction T', then xid(MT') > xid(MT).
PROOF. Suppose v E abortactions(T) accesses a memory location £ with owner(£) = A.
Since abortactions(T) C mem0ps(T), by the properties of Xmodules given in Definition 20, we
know that either A E modAnces(MT), or xid(A) > xid(MT). If A E modAnces(AIT), then
by Definition 35, T already had £ in its read or write set. Therefore, using Definition 21, v can
not generate a conflict with T' because then T would already have had a conflict with T' before v
occurred, contradicting the eager conflict detection of the OAT model.
Thus, we have xid(A) > xid(MT). If v conflicts with some other transaction T', then T' has
£ in its read or write set. Therefore, from Corollary 8.12, MT' E modDesc(A). Thus, we have
xid(MT') > xid(A) > xid(MAT). O
Theorem 8.14 In the case where aborted actions have limited footprint, the OAT model is free
from semantic deadlock.
171
PROOF. Let Co be the computation tree after any finite sequence of to instructions. We describe
a schedule F which finishes aborting all transactions in the computation by executing abort actions
and transactions serially.
Without loss of generality, assume that at time to, status[T] = PENDINGABORT for all
active transactions T. Otherwise, the first phase of the schedule F is to make this status change for
all active transactions T.
For a module tree D with k = IDI Xmodules (including the world), we construct a schedule F
with k phases, numbered k - 1, k - 2,... 1, 0. The invariant we maintain is that immediately before
phase i, we bring the computation tree into a state C(i) which has no active transaction instances
T with xid(MT) > i, i.e., no instances T from Xmodules with xid larger than i. During phase
i, we finish aborting all active transaction instances T with xid(MT) = i. By Lemma 8.13, any
abort action for a T, where xid(MT) = i, can only conflict with a transaction instance T' from a
lower-level Xmodule, where xid(MT') > i. Since the schedule F executes serially, and since by
the inductive hypothesis we have already finished all active transaction instances from lower levels,
phase i can finish without generating any conflicts. []
Restrictions on compensating actions
If transactions Y1, Y2,... Yj are nested inside transaction X and X aborts, typically abort actions
of X simply consists of compensating actions for Y1, Y2,... Yj. Thus, restrictions on abort actions
translate in a straightforward manner to restrictions on compensating actions: a compensating ac-
tion for a transaction Yi (which is part of the: abort action of X), should not read (write) any memory
owned by MX or its ancestor Xmodules unless the memory location is already in X's read (write)
set. Assuming locks are modeled as accesses to memory locations, the same restriction applies,
meaning, a compensating action can not acquire new locks that were not already acquired by the
transaction it is compensating for.
8.8 Related Work
In this chapter, we described ownership-aware transactions, which provide a disciplined method-
ology for open nesting while guaranteeing abstract serializability. In this section, we describe two
other approaches for improving open-nested transactions, and distinguish them from ownership-
aware transactions.
In [NMAT+07], Ni, et al. propose using an open_atomic class to specify open-nested trans-
actions in a Java-like language with transactions. Since the private fields of an object with an
open_atomic class type can not be directly accessed outside of that class, one can think of the
open_atomic class as defining an Xmodule. This mapping is not exact, however, because nei-
ther the language or TM system restrict exactly what memory can be passed into a method of
an open_atomic class, and the TM system performs a vanilla open-nested commit for a nested
transaction, not a safe-nested commit. Thus, it is unclear what exact guarantees are provided with
respect to serializability and/or deadlock freedom.
Herlihy and Koskinen in [HK08] describe a technique of transactional boosting which allows
transactions to call methods from a nontransactional module A. Roughly, as long as A is lineariz-
able and its methods have well-defined inverses, the authors show that the execution appears to
be "abstractly serializable." Boosting does not, however, address the cases when the lower-level
172
;___=_i_~_~~_~_j~~_li=r~._/_~_l~____ _~__l___l~~jjijl_/I ilili-i l_~~illl-_ -(~ -I__I1----I-~~^--- __;t _.i-:-___ii-..._.i;~ _-i_ l ;:;iji;:l:l.i i(::i -i--i-~-- : r-l li-il-r-.lli_~--_i:i--i__:_____~_
module A writes to memory owned by the enclosing higher-level module, or when programs have
more than two levels of modules.
173
174
Chapter 9
Nested Parallelism within Transactions
Most TM implementations do not allow parallelism inside transactions. Therefore, a function con-
taining parallelism (and transactions to synchronize between its parallel threads) cannot be called
from within another transaction. This limitation manifests itself in two ways. First, it breaks ab-
straction by disallowing certain method calls purely on the basis of their implementation. Second,
in programs using a dynamic-multithreaded language such as MIT Cilk, it is unnatural to write code
with no parallelism inside transactions; adding transactions to these languages in a natural manner
generates code with parallelism inside transactions. This chapter describes the first provably effi-
cient software transactional memory system that allows parallelism inside transactions, specifically
for languages that use Cilk-like work-stealing schedulers. Although this work is primarily theoret-
ical, it provides a basis for concurrency platforms that allow transactions in dynamic multithreaded
languages.
The remainder of this chapter is organized as follows. Section 9.1 provides the motivation of
this work and the results. We use a computation tree described in Chapter 6 to model a compu-
tation with nested parallel transactions; Section 9.2 describes the computation tree and how the
CWSTM runtime system maintains this computation tree online. Section 9.3 defines our CWSTM
semantics. We show in Section 9.4 that a naive conflict-detection algorithm has poor worst-case
performance. Section 9.5 describes the high-level design of CWSTM and its use of XConflict for
conflict-detection. Section 9.6 gives an overview of the XConflict algorithm. Sections 9.7 and 9.10
provide details on data structures used by XConflict. Finally, Section 9.11 claims that XConflict,
and hence CWSTM, is efficient for programs that experience no conflicts or contention.
9.1 Motivation and Results
Most work on transactional memory focuses exclusively on supporting transactions in programs
that use persistent threads (e.g., pthreads). TM systems for such an environment are designed as-
suming that transactions execute serially, since the overhead of creating or destroying a pthread
naturally discourages programmers from having nested parallelism inside a transaction. Further-
more, the special case of serial transactions greatly simplifies conflict detection for TM. Typically,
the TM runtime detects a conflict between two distinct active transactions if they both access the
same object £, at least one transaction tries to write to £, and both transactions are executed on
This is joint work with Jeremy Fineman and Jim Sukha [AFS08].
175
PARALLELINCREMENTO
1 x -- 0
2 parallel
3 { -x+1 }
4 { x +- x+10
5 parallel
6 { x<-x+100 }
7 { x<-x + 1000 } }
8 print x
Figure 9.1: A simple fork-join program that does several parallel increment of a shared variable. The
parallel statement, similar to Dijkstra's "cobegin," allows the two following code blocks (each con-
tained in {.}) to run in parallel. The subsequent line (line 8) executes after both parallel blocks com-
plete. This program contains several races-assuming sequential consistency, valid outputs are x G
(1, 11, 101, 110, 111, 1001, 1010, 1011, 1101, 1110, 1111}.
Figure 9.2: The series-parallel dag for the sample program given in Figure 9.1. Edges correspond to in-
structions in the program. Diamonds and squares correspond to the start and end, respectively, of parallel
constructs.
different threads. This last condition is relevant for TM that supports nested transactions. Concep-
tually, when transactions execute serially, two active transactions executing on the same thread are
allowed to access the same object because one must be nested inside the other.
In dyanamic multithreaded languages such as Cilk [BJK+96, Sup06] or NESL [BG96] or mul-
tithreaded libraries like Hood [BP99], a programmer specifies dynamic parallelism using linguistic
constructs such as "fork" and "join," "spawn" and "sync," or "parallel" blocks. Dynamic multi-
threaded languages allow the programmer to specify a program's parallel structure abstractly, that
is, permitting (but not requiring) parallelism i. specific locations of the program. A runtime system
(e.g., for Cilk) dynamically schedules the program on the number of processors, P, specified at
execution time. If the language also permits only "properly nested" parallelism, i.e., any program
execution can be represented as a "series-parallel dag" or "series-parallel parse tree" [FL97], then
a Cilk-like work-stealing scheduler executes the program in a provably efficient manner [BJK+96].
A natural question arises: how can transactions be added to a dynamic multithreaded language such
as Cilk?
To pose the problem more concretely, consider the series-parallel program shown in Figure 9.1,
which performs parallel increments to a shared variable. Figure 9.2 gives the corresponding series-
parallel dag for the program. One natural way to add transactions to a series-parallel program is by
wrapping segments of the program in atomic blocks, as illustrated by Figure 9.3. As shown, it is
easy to generate transactions (e.g., X3) with nested parallelism and nested transactions (e.g., X4).
How does a TM system execute the program in Figure 9.3?
This chapter describes a way of adding transactions to a dynamic multithreaded language that
generates only series-parallel programs. We focus on a provably efficient TM system that supports
176
/
XPARALLELINCREMENT()
1 atomic { > Transaction X 1
2 x +- 0
3 parallel
4 { atomic {x +- x + 1} } > X2
5 { atomic { > X 3
6 x +-- x+ 10
7 parallel
8 { x +-x+ 100
9 { atomic {x +-- x + 1000} }> X 4
10 } }
11 }
12 print x
Figure 9.3: The program from Figure 9.1 with the addition of some transactions, denoted by atomic{.}
blocks. The triangle denotes a comment. Since atomic blocks are not placed around all increments, this
program still permits multiple outputs--valid outputs are 111 and 1111. The (symmetric) 1011 is excluded
due to strong atomicity.
unbounded nesting and parallelism. That is, we want a TM system with a bound on a program's
completion time that is independent of the maximum nesting depth of transactions. It turns out that
TMs that perform work on every transaction commit proportional to the size of the transaction's
"readset" or "writeset" cannot support an unbounded nesting depth efficiently. Generally, TM with
lazy conflict detection requires work proportional to the size of the transactions readset, and TM
with lazy updates require work proportional to the size of the transaction's writeset. Thus, we focus
on TM with eager conflict detection and eager updates, since both require only a constant amount
of work on every commit.
A key component of a TM system with nested parallelism is the conflict-detection scheme.
We describe the semantics for TM with eager conflict detection for series-parallel computations
with transactions. We present XConflict, a data structure that a software TM system can use to
query for conflicts when implementing these semantics. For Cilk-like work-stealing schedulers, the
XConflict answers concurrent queries in 0(1) time and can be maintained efficiently. In particular,
consider a program with T1 work and a span (or critical-path length) of Tm, The running time on P
processors of the program augmented with XConflict is only O(T 1/P+PT,). In comparison, with
high probability, Cilk executes the program without XConflict in the asymptotically optimal time
O(T 1/P + T,). These two bounds imply that maintaining the XConflict data structure does not
asymptotically increase the running time of the program, compared to Cilk, when VT 1/T->> P.
We describe CWSTM, a design for a software TM system with eager updates that uses the
XConflict data structure. CWSTM provides strong atomicity and supports lazy cleanup on aborts
(i.e., when a transaction X aborts, other transactions can help roll back the objects modified by X).
The XConflict bounds translate to CWSTM in a restricted case when there are no concurrent read-
ers (all memory accesses are treated as writes) and there are no transaction abort. If the underlying
transaction-free program has T work and T, span, then the CWSTM executes the transactional
177
program in time O(T 1/P + PToo) when run on P processors. At first glance, these bounds might
seem uninteresting due to the restrictions. It is difficult, however, for any TM system to provide
any nontrivial bounds on completion time in the presence of aborts, since the system might redo an
arbitrary amount of work. Moreover, TM with eager conflict detection that allows more than a con-
stant number of shared readers to an object can potentially lead to memory contention; thus, even
if there are no conflicts on that object, it seems difficult to provide efficient worst-case theoretical
bounds.
9.2 CWSTM Framework
We use the computation tree framework, described in Chapter 6 to model CWSTM program exe-
cutions. For simplicity in explanation, however, we use a canonical computation tree. In addition,
instead of using the tree for aposteriori analysis, as in the previous chapter, CWSTM runtime sys-
tem explicitly maintains the computation tree. The computation tree is not given apriori (i.e., from
a static analysis of the program); rather, it unfolds dynamically as the program executes. Moreover,
nondeterminism in the program may result in different computation trees. Constructing the com-
putation tree on the fly as the program executes is not difficult and thus not described in full in this
paper. A partial program execution corresponds to partial traversal of the computation tree.
In our canonical computation tree, all P-nodes have exactly 2 nontransactional S-nodes as chil-
dren, while S-nodes can have an arbitrary number of children. In addition, we require that no
nontransactional S-node has a child nontransactional S-node. Thus, it follows that all nontransac-
tional S-nodes are children of P-nodes (or the root of the tree). For convenience, we treat the root
of the computation tree as both a transactional and a nontransactional S-node.
For any node B = root(C), we define the transactional parent xparent [B] as Z = parent [B]
if Z is a transaction or root(C), and as xparent[Z] otherwise. Similarly, we define nontransac-
tional S-node parent nsParent[B] as Z = parent[B] if Z is a nontransactional S-node or root(C),
or nsParent [Z] otherwise.
At any point during the computation-tree traversal, each node B in the computation tree has a
status, denotedby status [B]. The status can be one of PENDING, PENDINGABORT, COMMITTED,
or ABORTED. A leaf is complete if the corresponding operation has been executed. An internal node
can complete only if all nodes in its subtree are complete. Thus, a complete node corresponds to the
root of a subtree that will not unfold further, and hence a node is complete if and only if its status
is COMMITTED or ABORTED. A node is active (having status PENDING or PENDINGABORT) if
it has any unexecuted descendants. Once a node is complete, it can never become active again.
Any execution of the computation tree has the invariant that at any time, the set of active nodes
in the computation tree also forms a tree, with the leaves of this active tree being the set of ready
nodes. Only a node that is ready can be traversed to "discover" a new child node. (Discovering a
new child node corresponds to executing an instruction in the program: a read or write creates new
leaf below the current S-node, a transactional begin creates a new transactional S-node, and a fork
statement creates a new P-node along with its two S-node children.) When a ready transactional
S-node completes, its parent becomes ready. When a ready nontransactional S-node Z completes,
if Z's sibling nontransactional S-node is already complete, then Z's parent (which is a P-node)
completes, and Z's grandparent becomes ready.
Figure 9.5 shows the structure of the computation tree after an execution of the code from
178
=:jjil:s__:iiil_;_i-~-~-l-X?.___II_;__ ;~ -~_-i-_i ---~L_..~---.-l.-l;i/-iii:_,i.:.l i^li : -liii- i~
] Xaction Border (status)O ActiveO Nontransactional S-node CompleteO CompleteO P-node
0 Active or completeO Trace : Waiting
Figure 9.4: A legend for computation-tree figures.
v 2  u4 V4 u 4 ' V4 ' u 6 v6 u8 v8 u 9 v9  u 1 2
x-0 x-X+I X*-x+t x-x+10 _-x+IO4 x*-x+400 printx
Figure 9.5: A computation tree for an execution of the program given by Figure 9.1, in which transaction X2
aborted once and was retried as the transaction X2. The root X0 does not correspond to any transaction in
the program-it is just the S-node root of the tree. Each increment to x on line j of the program decomposes
into two atomic memory operations: a read uj, and a write vj. The corresponding code is shown in a
gray oval under the accesses.
179
Figure 9.1 in which transaction X2 aborts once. If other aborts (and retries) occur, the computation
tree would have additional subtrees.
9.3 CWSTM Semantics
This section describes CWSTM semantics,, a semantics for a generic transactional memory sys-
tem with nested parallel transactions and eager conflict detection. We describe these semantics
operationally, in terms of a "readset" and "xriteset" for each transaction. In particular, we define
conflicts and describe transaction commits and aborts abstractly using readsets and writesets. Later,
in Section 9.4, we give a simple design for a TM that provides these semantics. In Section 9.5, we
improve the simple TM and present the the CWSTM design.
At any point during the program execution, the readset of a transaction X is the set of objects
£ that X has "accessed". Similarly, the writeset is a set of objects £ that X has written to. Opera-
tionally, readsets and writesets change as follows. A transaction begins with an empty readset and
empty writeset. Whenever a successful read of £l occurs in a memory-operation (leaf) node u,
1 is added to xparent [u]'s readset. Similarly, whenever a successful write of £2 occurs in a
memory-operation node u, £2 is added to xparent[u]'s readset and writeset. A read or a write
to £ by an operation u "observes" the value associated with the write stored in the writeset of Z,
where Z is the nearest transactional ancestor of u that contains £ in its writeset. For consistency,
the writeset of the computation-tree's root contains all objects. When a transaction X commits, its
readset and writeset are merged into xparent [X]'s readset and writeset, respectively.
A transactional memory with eager conflict detection must test for conflict before performing
each read or write. An access is unsuccessful if it generates a transactional conflict. TM
systems with serial, closed-nested transactions report conflicts when two active transactions on
different threads are accessing the same object £, and one of those accesses is a write. Thus, only a
single active transaction is allowed to contain £ in its writeset at one time. For CWSTM semantics,
we generalize this definition of conflict in a straightforward manner. At any point in time, let
readers(£) and writers(£) be the sets of active transactions that have object £ in their readsets or
writesets, respectively. Then, we define conflicts as follows:
Definition 36 At any point in time, a memory operation v generates a conflict if
1. v reads object £, and 3X E writers() such that X 0 ances(v), or
2. v writes to object £, and ]X E readers(£) such that X 0 ances(v).
If there is such a transaction X, then we say that v conflicts with X. If v belongs to the transaction
X', then we say that X and X' conflict with each other
If a memory operation v would cause a conflict between X = xparent [v] and another trans-
action X', then v triggers an abort of either X or X' (or both). Say X is aborted. An abort of
a transaction X changes status[X] from PENDING to PENDINGABORT, and also changes the
status of any PENDING (nested) transaction Y in the subtree of X to PENDING.ABORT. In gen-
eral, a PENDINGABORT transaction X that is also ready can only complete by changing its status
to ABORTED. Conceptually, when a transaction X is ABORTED, CWSTM semantics discards X's
writeset and readset. Since X is no longer active after this action occurs, the action also concep-
tually removes X from readers(£) and writers(£) for all objects £. Note that in CWSTM, if v
180
: ;; i ii:;i.;;_lil:j~l;_ii___~_lr_____lr__l _ _____jlll~__~lli_ -i;i__- ii iC llii: ~-.  ~:; -ii ---_:- ;;~I) I~_- i-_i--i~~-i-_.ii----~, i liil:ii: i .l. ii ;ii ::li: l  ~-~-
causes a conflict, and the runtime chooses to abort X' = xparent [v], then the conflict is not fully
resolved until status[X'] has changed to ABORTED.
Consider a computation subtree rooted at a transaction X with status[X] = PENDING. Since
we allow only closed-nested transactions, if every child of X has completed, CWSTM can commit
X, i.e., change X's status from PENDING to COMMITTED, and merge X's readset and writeset
into those of xparent [X].
Code example
We can now describe how the CWSTM semantics constrain the possible outputs of the program in
Figure 9.3. Since parallelism is allowed in transactions, we must consider the scoping of atomicity.
In particular, the x - x + 1 in line 4 and the code block in lines 6-9 must appear as though
one executes entirely before the other. If the atomic statements in lines 4 and 5 were removed,
then these two blocks could interleave arbitrarily, even though the entire procedure is protected by
an atomic statement in line 1. Basically, the atomicity applies only when comparing two blocks
of code belonging to different transactions (protected by different atomic statements), not parallel
blocks within the same transaction (protected by the same atomic statement).
Conflict as stated in Definition 36 naturally enforces strong atomicity [BLM06]. Strong atom-
icity implies that although line 8 is not atomic, it cannot perform its write between line 9's read
and write. In terms of the computation tree in Figure 9.5, after u9 performs a read of z, it adds
x to the readset of X4 ; thus, after u9 occurs but before X4 commits, if v8 tries to write to x, it will
cause a conflict with X 4. We can, however, have line 8 read x, line 9 read and write x and commit,
and then line 8 write x. This interleaving can occur because when u8 happens, it adds x to the
readset of X3 , and u9 and v9 can subsequently happen because they are both descendants of X3
in the computation tree. This behavior means that the increment of 1000 can be "lost" (by being
overwritten) but the increment of 100 cannot. Another way of describing strong atomicity is that
each memory operation is viewed as a transaction.
Semantic guarantees
The CWSTM semantics maintains the invariant that a program execution is always conflict-free,
according to Definition 36. One can show that when transactions have nested parallel transactions,
TM with eager conflict detection according to Definition 36 satisfies the transactional-memory
model of prefix race-freedom defined in [ALS06].' As shown in [ALS06], prefix race-freedom and
serializability are equivalent if one can safely "ignore" the effects aborted transactions. Note that
this equivalence may not hold in TM systems with explicit retry constructs that are visible to the
programmer.
Definition 36 directly implies the following lemma about a conflict-free execution.
Lemma 9.1 For a conflict-free execution, the following invariants hold for any object £:
1. All transactions X E writers (f) fall along asingle root-to-leafpath in C. Let lowest(writers() )
denote the unique transaction Y E writers(') such that writers(f) C ances(Y).
1The proof is a special case of the proof for the operational model described in Chapter 6, without any open-nested
transactions.
181
2. All transactions X E readers(f) are either along the root-to-leafpath induced by the writ-
ers or are descendants of the lowest(writers(e)).
We use Lemma 9.1 to argue that one can check for conflicts for a memory operation u by
looking at one writer and only a small number of readers. Since all the transactions fall on
a single root-to-leaf path, by Lemma 9.1, Invariant 1, the transaction lowest(writers(e)) be-
longs to writers() and is a descendant of all transactions in writers(£). Similarly, let Q =
lastReaders(f) C desc(lowest(writers(f))) denote the set of readers implied by Invariant 2.
If a memory operation u tries to read £, abstractly, there is no conflict exactly if and only if
lowest(writers(f)) is an ancestor of u. Similarly, when u tries to write to f, by Invariant 2,
there is no conflict if for all Z E lastReaders(f), Z is an ancestor of u.
9.4 A Naive TM
The CWSTM semantics described in Section 9.3 suggest a design for a TM system that supports
transactions with nested parallelism. In particular, Lemma 9.1 suggests that for each object £, the
TM can maintain an active writing transaction lowest (writers(f)) and some active reading trans-
actions lastReaders(f). This scheme allows transactions accessing f to test for conflicts against
these transactions. This section focuses on a straightforward data structure, called an "access stack,"
used to maintain these values. We show that an access stack yields a TM with poor worst-case per-
formance, even assuming the rest of the TM system incurs no overhead. The CWSTM design uses
a lazy variant of the access stack, described in Section 9.5, that has much better performance.
The access stack for an object f is a stack containing the active transactions that have written
to £ and sets of active transactions that have read from f. The order of transactions on the stack
is consistent with the ancestry of transactions in the computation tree. The writing transaction
lowest(writers ()) is either on top (first item to pop) of the stack, or is the next element on the
stack. If the writer is not on the top of the stack, then lastReaders(f) is. No two consecutive
elements are sets of readers.
The access stack is maintained as follows, locking the relevant stack on all memory access to
guarantee atomicity. Consider (a memory operation whose transactional parent is) a transaction X
that successfully reads E. If the top of the stack contains a set of readers, then X is added to that
set, assuming it is not already there. If the top of the stack is a writer other than X, then {X} is
added to the top of the stack. Similarly, if X successfully writes f, then X is pushed onto the top
of the stack if it not already there.
Whenever a transaction X commits, for each f in X's readset, X is removed from the top of
's access stack and replaced with xparent[X] (in a fashion that ensures there are no duplicated
transactions). This action mimics the commit semantics from Section 9.3: when a transaction X
commits, the objects in its readset and writeset are moved to xparent [X]'s readset and writeset,
respectively. If instead X aborts, then X is. popped from each relevant object's access stack. To
facilitate rollback on aborts, every access-stack entry corresponding to a write stores the old value
before the write.2
Maintaining the access stack has poor worst-case performance because the work required on the
commit of transaction X is proportional to the size X's readset. If the original program (without
2This value can either be stored in the stack itself, or in a log per transaction.
182
transactions) had work T 1, then this implementation might require work Q(dT), where d is the
maximum nesting depth of transactions. In particular, consider the following code snippet:
void f(int i) {
if (i >= 1) ( atomic { x[i]++; f(i-1); } }
}
A call of f (d) generates a serial chain of nested transactions, each incrementing a different place
in the array x. When the transaction at nesting depth j commits, it updates d - j access stacks
for a total of 8(d 2 ) access-stack updates. The work of the original program (without transactions),
however, is only 8(d).
In general, this asymptotic blowup can occur if a TM system with nested transactions must
perform work proportional to the size of a transaction's readset or writeset on every commit. For
example, a TM system that validates every transaction due to lazy conflict detection for reads
exhibits this problem. Similarly, a TM system that copies data on commit due to lazy object updates
also has this issue.
9.5 CWSTM Overview
This section describes our CWSTM design for a transactional-memory system with nested parallel
transactions and eager updates and eager conflict detection. We first describe how CWSTM up-
dates the computation-tree-node statuses on commits and aborts. We then give an overview of the
conflict-detection mechanism, deferring details of the XConflict data structure to later sections. The
conflict-detection mechanism includes a "lazy access stack," improving on the shortcoming of the
access stack from Section 9.4. Finally, we describe properties of the Cilk-like work-stealing sched-
uler that CWSTM uses. The XConflict data structure requires such a scheduler for its performance
and correctness.
CWSTM explicitly builds the internal nodes of the computation tree (i.e., leaf nodes for memory
operations are omitted). Each node maintains a status field which in most cases, explicitly repre-
sents the node's status (PENDING-ABORT, PENDING, COMMITTED, or ABORTED), and changes
in a straightforward fashion. For example, when a transaction X commits, CWSTM atomically
changes status[X] from PENDING to COMMITTED.
Since a transaction may signal an abort of a transaction running on a (possibly different) pro-
cessor whose descendants have not yet completed, aborting transactions is more involved. When
an active transaction X aborts itself (possibly because of a conflict) it simply atomically updates
status[X] -- ABORTED. We refer to this type of update as an xabort. Alternatively, suppose
a processor pi wishes to abort X even though pi is not currently executing X. First, pi atomi-
cally changes status[X] from PENDING to PENDINGABORT. Then pi walks X's active subtree,
changing status[Y] -- PENDINGABORT atomically for each active Y E desc(X). Notice that
pi never changes any status to ABORTED---only the processor running a transaction Y is allowed to
perform that update. When X "discovers" that its status has changed to PENDINGABORT, it has
no active descendants (otherwise, X cannot be ready, and hence X cannot be executing). Then, X
simply performs an xabort on itself.
For reasons specific to XConflict, the data structure the CWSTM design uses for conflict de-
tection, during an abort of X, some of X's COMMITTED descendants Y also have their status
183
XCONFLICT-ORACLE(X, u)
r> For any node X and active memory operation u
1 if 3Z E ances(X) such that status[Z]J - ABORTED
2 then return "no conflict: X aborted"
3 Y +- closest active transactional ancestor of X
4 ifY E ances(u)
5 then return "no conflict: X committed to u's ancestor"
6 else pick a transaction B in (ances(Y) - ances(LCA(Y, u)))
7 return "conflict with B"
Figure 9.6: Pseudocode for a conflict-detection query suggested by Definition 37. Many subroutine (e.g.,
line 3) details are omitted (and in fact do not have efficient implementations). The LCA function returns the
least common ancestor of two nodes in the computation tree.
field changed to ABORTED. Our conflict-detection algorithm uses these updates to more quickly
determine that a memory operation does not conflict with Y, since Y has an ABORTED ancestor
X. Section 9.9 describes when these updates occur.
In CWSTM, the rollback of objects on abort occur lazily, and thus is decoupled from an
xabort operation. Once the status of a transaction X changes to ABORTED, other transactions
that try to access an object modified by X help with cleanup for that object.
Conflict detection and the aazy access stack
We now discuss conflict detection. The key observation that allows us to avoid explicit maintenance
of active readers and writers (or transaction readsets and writesets) is the following alternate conflict
definition.
Definition 37 Consider a (possibly inactive) transaction X that has written to e and a new memory
operation v that reads from or writes to f. Then v does not conflict with X if and only if
1. some transactional ancestor of X has aborted, or
2. X 's nearest active transactional ancestor is an ancestor of v.
The case when X has read from f and v writes to £ is analogous.
This definition is equivalent to Definition 36 because X's nearest active transactional ancestor
logically belongs to writers(£) if X doesn't have an aborted ancestor.
Definition 37 suggests a conflict-detection algorithm that does not require maintaining lowest (writ ers(£)
and the normal access stack. In particular, let X be the last node that has successfully written to
£. Then when u accesses £, test for conflict by finding X's nearest active transactional ancestor Y
and determining whether Y is an ancestor of u. Figure 9.6 gives pseudocode for this test. Note that
CWSTM does not actually implement this query as given-instead, it uses an equivalent, but more
efficient query, described in Section 9.6.
184
------.... . .... ...
To maintain the most recent successful write (and reads), facilitating the necessary conflict
queries, CWSTM uses a lazy access stack. The structure of the lazy access stack is somewhat
different from the simple access stack given in Section 9.4. An object g's lazy stack stores (possibly
complete) transactions that have written to f and sets of transactions that have read from t, but now
these stack entries are ordered chronologically by access. The top of the stack holds the last writer
or the last readers. We have the invariant that if a transaction X on the stack has aborted, then
all transactions located above X on the stack (later chronologically) also have aborted ancestors,
and thus represent deprecated values. The main difference in maintenance is that the lazy access
stack is not updated on transactional commit (thus ignoring the merge of a transaction's readset and
writeset into its parent's). On memory operations, new transactions are added to the access stack in
the same way as described in Section 9.4.
Figure 9.7 gives pseudocode for an instrumentation of each memory access, assuming for sim-
plicity that all memory accesses behave as write instructions.3 Incorporating readers into the
access stacks is more complicated, but conceptually similar. If a memory access u does not belong
to an aborting transaction, then it is allowed to proceed. First, we test for conflict with the last
writer in lines 4-5. If the last writer has aborted (or has an aborted ancestor), handled in lines 6-9,
then the access stack should be cleaned up by calling CLEANUP. (This auxiliary procedure, given
in Figure 9.8, rolls back the value of the topmost aborted transaction on f's access stack.) Since
there is no new conflict, after CLEANUP, the access should be retried. If, on the other hand, there
is a conflict between u and an active transaction (lines 10-16), then either xparent[u] must abort
or the conflicting transaction (B) must abort. Finally, if there are no conflicts, then the access is
successful. The access stack is updated as necessary (lines 18-20), and the access is performed.
Note that while u is running the ACCESS method, concurrent transactions (that access f) can
continue to commit or abort. The commit or abort of such a transaction can eliminate a conflict
with u, but never create a new conflict with u. Thus, concurrent changes may introduce spurious
aborts, but do not affect correctness.
The CWSTM scheduler
XConflict relies on a Cilk-like work-stealing scheduler for efficiency and correctness. The main
idea of a work stealing is that when a processor completes its own work, it "steals" work from
a different victim processor. Conceptually, the entire (unexpanded) computation tree is initially
"owned" by a single processor. A processor traverses the current subtree that it owns, and only that
subtree that it owns. As this processor "discovers" P-nodes it executes one of its nontransactional
S-node children, and the other child can subsequently be stolen by a thief processor. Whenever a
processor pi has no work (does not own a subtree), it steals a subtree T rooted at such a nontrans-
actional S-node. Thus, pi now owns and traverses the subtree T.
When we say a work-stealing scheduler is 'Cilk-like," as required by XConflict, we mean that
it has the following two properties. First, a processor executes its computation subtree in a left-to-
right fashion. Second, whenever a thief processor steals work from a victim processor, it steals the
right subtree from the highest P-node in the victim's subtree that has work available.
3It is possible to reduce locking on the access stack, but we do not describe that optimization in this paper.
185
ACCESS(u, f)
1 Z +- xparent[u]
2 if status[Z] = PENDING.ABORT return XABORT
> Otherwise Z is active
3 accessStack().LOCK()
> Set X to be the last writer.
4 X <- accessStack(f).TOP()
5 result -- XCONFLICT-ORACLE(X, u),
if result is "no conflict: X aborted"
then accessStack().UNLOCK()
CLEANUP(f)
return RETRY
> Rollback some values
> The access should be retried
if result indicates a conflict with transaction B
then if choose to abort self
then accessStack(f).UNLOCK()
return XABORT
else accessStack(f).UNLOCK()
signal an abort of B
return RETRY
> Otherwise, there is no conflict: X is an
if Z x
then > Log the access
LOGVALUE(Z, f)
accessStack(f).PUSH(Z)
ancestor of Z
> Z's first access to f
> Actually perform the write operation
21 Perform the write
22 accessStack().UNLOCK(
23 return SUCCESS
Figure 9.7: Pseudocode instrumenting an access by u to an object £, assuming that all accesses are writes.
ACCESS(u, £) returns XABORT if Z should abort
,
, RETRY if the access should be retried, or SUCCESS if the
memory operation succeeded.
186
/
CLEANUP(f)
1 accessStack().LOCK()
2 X -- accessStack().TOP()
3 if 3Z E ances(X) such that status[Z] = ABORTED
4 then RESTOREVALUE(X, f) > Restore g from X's log
5 accessStack(f).POP()
6 accessStack(f).UNLOCK()
Figure 9.8: Code for cleaning up an aborted transaction from the top of accessStack(f), assuming all
accesses are writes. If the last writer has an aborted ancestor, it should be rolled back.
9.6 CWSTM Conflict Detection
This section describes the high-level XConflict scheme for conflict detection in CWSTM. As the
computation tree dynamically unfolds during an execution, our algorithm dynamically divides the
computation tree into "traces," where each trace consists of memory operations (and internal nodes)
that execute on the same processor. Our algorithm uses several data structures that organize either
traces, or nodes and transactions contained in a single trace. This section describes traces and gives
a high-level algorithm for conflict detection.
By dividing the computation tree into traces, we reduce the cost of locking on shared data
structures. Updates and queries on a data structure whose elements belong to a single trace are also
performed without locks because these updates are performed by a single processor. Data structures
whose elements are traces also support queries in constant time without locks. These data structures
are, however, shared among all processors, and therefore require a global lock on updates. Since
the traces are created only on steals, however, we can bound the number of traces by O(pT,.)-the
number of steals performed by the Cilk-like work-stealing runtime system. Therefore, the number
of updates on these data structure can be bounded similarly.
The technique of splitting the computation into traces and having two types of data structures-
"global" data structures whose elements are traces and "local" data structures whose elements
belong to a single trace-appears in Bender et al.'s [BFGL04] SP-hybrid algorithm for series-parallel
maintenance (later improved in [FinO5]). Our traces differ slightly, and our data structures are a
little more complicated, but the analysis technique is similar.
Trace definition and properties
XConflict assigns computation-tree nodes to traces in the essentially the same fashion as the SP-
hybrid data structure described in [BFGLO4, FinO5]. We briefly describe the structure of traces
here. Since our computation tree has a slightly different canonical form from the canonical Cilk
parse tree use for SP-hybrid, XConflict simplifies the trace structure slightly by merging some
traces together.
Formally, each trace U is a disjoint subset of nodes of the (a posteriori) computation tree. We
let Q denote the set of all traces. Q partitions the nodes of the computation tree C. Initially, the
entire computation belongs to a single trace. As the program executes, traces dynamically split into
187
steal S2
Figure 9.9: Traces of a computation tree (a) before and (b) after a steal action. Before the steal, only
one processor is executing the subtree, but S2 and S6 are ready. After the steal, the subtree rooted at the
highest ready S-node (S2) is executed by the thief The subtree rooted at S1, on the other hand, is still owned
and executed by the victim processor.
multiple traces whenever steals occur.
A trace itself executes on a single processor in a depth-first manner. Whenever a steal occurs
and a processor steals the right subtree of a P-node P E U, the trace U splits into three traces Uo,
U1, and U2 (i.e., Q -- Q U {Uo, U1, U 2 } - {U}). Each of the left and right subtrees of P become
traces U1 and U2, respectively. The trace Uo consists of those nodes remaining after P's subtrees
are removed from U. Notice that although the processor performing the steal begins work on only
the right subtree of P, both subtrees become new traces. Figure 9.9 gives an example of traces
resulting from a steal. The left and right children of the highest uncompleted P-node P1 (both these
nodes are nontransactional S-nodes in our canonical tree) are the roots of two new traces, U1 and
U2-
Traces in CWSTM satisfy the following properties.
Property 1 Every trace U E Q has a well-defined head nontransactional S-node S = head[U] E
U such that for all nodes B E U, we have S E ances(B).
For a trace U E Q, we use xparent[U] as a shorthand for xparent[head[U]]. We similarly define
nsParent[U].
Property 2 The computation-tree nodes ofa trace U E Qform a tree rooted at S = head[U].
Property 3 Trace boundaries occur at P-nodes; either both children of the P-node and the node
itself belong to diferent traces, or all three nodes belong to the same trace. All children of an
S-node, however belong to the same trace.
188
Property 4 Trace boundaries occur at "highest" P-nodes. That is, suppose a P-node P has a
stolen child (i.e., P and its children belong to different traces). Consider all ancestor P-nodes P'
of P such that P is in the left subtree of P'. Then P' must have a stolen child (i.e., P and ancestor
P' belong to different traces).
The last property follows from the Cilk-like work stealing.
The partition Q of nodes in the computation tree C induces a tree of traces J(C) as follows.
For any traces U, U' E Q, there is an edge (U, U') E J(C) if and only if parent[head[U']] E U.4
The properties of traces and the fact that traces partition C into disjoint subtrees together imply that
J(C) is also a tree.
We say that a trace U is active if and only if head[U] is active. The following lemma states that
if a descendant trace U' is active, then U' is a descendant of all active nodes in U. The proof relies
on the fact that traces execute serially in a depth-first (or equivalently, left-to-right) manner.
Lemma 9.2 Consider active traces U, U' E Q, with U =A U'. Let D E U' be an active node, and
suppose D E desc(head[U]) (i.e., U' is a descendant trace of U). Then for any active node B E U,
we have B E ances(D).
PROOF. Since traces execute on a single processor in a depth-first manner, only a single head-
to-leaf path of each trace can be active. Thus, if a descendant trace U' is active, it must be the
descendant of some node along that path. In particular, we claim that U' is a descendant of the
leaf, and hence it is a descendant of all active nodes as the lemma states. This claim follows
from Property 3, because both children of the active P-node on the trace boundary must belong to
different traces. O
XConflict algorithm
Recall that CWSTM instruments memory accesses, testing for conflicts on each memory access
by performing queries of XConflict data structures. In particular, XConflict must test whether a
recorded access by node B conflicts with the current access by node u. Suppose that B does not
have an aborted ancestor. Then recall Definition 37 states that a conflict occurs if only if the nearest
uncommitted transactional ancestor of B is not an ancestor of u.
A straightforward algorithm (given in Figure 9.6) for conflict detection finds the nearest un-
committed transactional ancestor of B and determines whether this node is an ancestor of 7L. Main-
taining such a data structure subject to parallel updates is costly (in terms of locking overheads).
XConflict performs a slightly simpler query that takes advantages of traces. XConflict does not
explicitly find the nearest uncommitted transactional ancestor of B; it does, however, still deter-
mine whether that transaction is an ancestor of 'u. In particular, let Z be the nearest uncommitted
transactional ancestor of B, and let Uz be the trace that contains Z. Then XConflict finds Uz
(without necessarily finding Z). Testing whether Uz is an ancestor of u is sufficient to determine
whether Z is an ancestor of u. Note that XConflict does not lock on any queries. Many of the
subroutines (described in later sections) need only perform simple ABA tests to see if anything
changed between the start and end of the query.
4The function parent[] refers to the parent in the computation tree C, not in the trace tree J(C).
189
The XCONFLICT algorithm is given by pseudocode in Figure 9.10. lines 1-4 handle the simple
base cases. If B and u belong to the same trace, they are executed by a single processor, so there is
no conflict. If B is aborted, there is also no conflict.
Suppose B is not aborted and that B and a belong to different traces. XCONFLICT first finds
X, the nearest transactional ancestor of A that belongs to an active trace, in line 5. The possible
locations of X in the computation tree are shown in Figure 9.12. Let Ux = trace(X). Notice
that Ux is active, but X may be active or inactive. For cases (a) or (b), we find X with a simple
lookup of xparent [B]. Case (c) involves first finding U, the highest completed ancestor trace of
trace(B), then performing a simple lookup of xparent[U]. Section 9.9 describes how to find the
highest completed ancestor trace.
Line 9 finds Y, the highest active transaction in Ux. If Y exists and is an ancestor of X, as
shown in the left of Figure 9.13, then XCONFLICT is in the case given by lines 11-13. If Ux is an
ancestor of u, we conclude that A has committed to an ancestor of u. Figure 9.13 (a) and (b) show
the possible scenarios where Ux is an ancestor of u: either X is an ancestor u, or X has committed
to some transaction Z that is an ancestor of .
Suppose instead that Y is not an ancestor of X (or that Y does not exist), as shown in the left
of Figure 9.14. Then XCONFLICT follows the case given in lines 15-17. Let Z be the transactional
parent of Ux. Since X has no active transactional ancestor in Ux, it follows that X has committed
to Z. Thus, if trace(Z) is an ancestor of u,, we conclude that A has committed to an ancestor of u,
as shown in Figure 9.14.
Section 9.7 describes how to find the trace containing a particular computation-tree node (i.e.,
computing trace(B)). Section 9.8 describes how to maintain the highest active transaction of any
trace (used in line 9). Section 9.9 describes how to find the highest completed ancestor trace of a
trace (used for line 5), or find an aborted ancestor trace (line 3). Computing the transactional parent
of any node in the computation tree (xparent [B]) is trivial. Section 9.10 describes a data structure
for performing ancestor queries within a trace (line 10), and a data structure for performing ancestor
queries between traces (lines 11 and 15).
The following theorem states that XConflict is correct.
Theorem 9.3 Let B be a node in the computation tree, and let u be a currently executing memory
access. Suppose that B does not have an aborted ancestor Then XCONFLICT(B, u) reports a
conflict if and only if the nearest (deepest) active transactional ancestor of B is an ancestor of u.
PROOF. If B has an aborted ancestor, then XCONFLICT properly returns no conflict.
Let Z be the nearest active transactional ancestor of B. Let Uz be the trace containing Z; since
Z is active, Uz is active. Lemma 9.2 states that Uz is an ancestor of u if and only if Z is an ancestor
of u. It remains to show that XConflict finds Uz.
XConflict first finds X, the nearest transactional ancestor of B belonging to an active trace
(line 5). The nearest active ancestor of B must be X or an ancestor of X. Let Ux be the trace
containing X, and let Y be the highest active transaction in Ux. If Y is an ancestor of X, then
either Z = X, or Z is an ancestor of X and a descendant of Y (as shown in Figure 9.13). Thus,
XConflict performs the correct test in lines 11-13.
Suppose instead that Y, the highest active transaction in Ux, is not an ancestor of X. Then no
active transaction in Ux is an ancestor of X. Let Z be the transactional ancestor of Ux. Since Ux
is active, Z must be active. Thus, Z is the: nearest uncommitted transactional ancestor of B, and
XConflict performs the correct test in lines 15-17.
190
XCONFLICT(B, u)
> For any computation-tree node B and any
active memory-operation u
t> Test for simple base cases
1 if trace(B) =- trace(u)
2 then return "no conflict"
3 if some ancestor transaction of B is aborted
4 then return "no conflict: B aborted"
5 Let X be the nearest transactional ancestor of B
belonging to an active trace.
6 ifX = null r> committed at top level
7 then return "no conflict: B committed to root"
8 Ux +- trace(X)
9 Let Y be the highest active transaction in Ux
0 if Y = null and Y is an ancestor of X
1 then if Ux is an ancestor of u
2 then return "no conflict: B committed
to u's ancestor"
3 else return "conflict with Y"
4 else Z +- xparent[Ux]
5 if Z = null or trace(Z) is an ancestor of a
6 then return "no conflict: B committed
to u's ancestor"
7 else return "conflict with Z"
Figure 9.10: Pseudocode for the XConflict algorithm.
Edge shape Edge style
No active xaction on path No traces on path
* Active traces only
Active xactions on path
Complete traces only
Figure 9.11: The definition of arrows used to represent paths in Figures 9.12, 9.13 and 9.14.
a)b
x;:
c)tZ
IA
Figure 9.12: The three possible scenarios in which X is the nearest transactional ancestor of B that belongs
to an active trace. Arrows represent paths between nodes (i.e., many nodes are omitted): see Figures 9.4
and 9.11 for definitions. In both (a) and (b), B belongs to an active trace. In (a), xparent [B] belongs to the
same active trace as B. In (b), xparent[B] belongs to an ancestor trace of trace(B). In (c), B belongs to a
complete trace, U is the highest completed ancestor trace of B, and X is the xparent [U].
YUX
x
LbI
X
..c '
y
til
Figure 9.13: The possible scenarios in which the highest active transaction Y in Ux is an ancestor of X,
and Ux is an ancestor of u (i.e., line 11 of Figure 9.10 returns true). Arrows represent paths between nodes
(i.e., many nodes are omitted): see Figures 9.4 and 9.11 for definitions. The block arrow shows implication
from the left side to either (a) or (b).
-z
Figure 9.14: The scenario in which the highest active transaction Y in Ux is not an ancestor of X, and
Z = xparent[Ux] is an ancestor of u (i.e., line 15 of Figure 9.10 returns true). The block arrow shows
implication from the left side to the situation on the right.
192
In the above explanation, we assume that no XConflict data-structural changes occur concur-
rently with a query. The case of concurrent updates is a bit more complicated and omitted from
this proof. The main idea for proving correctness subject to concurrent updates is as follows. Even
when trace splits occur, if a conflict exists, XCONFLICT has pointers to traces that exhibit the con-
flict. Similarly, if XCONFLICT acquires pointers to a transaction (Y or Z) deemed to be active, that
transaction was active at the start of the XCONFLICT execution.
Note that XCONFLICT may return some spurious conflicts if transactions complete during the
course of a query.
9.7 Trace Maintenance
This section describes how to maintain trace membership for each node B in the computation tree,
subject to queries trace(B). The queries take O(1) time in the worst case. We give the main idea
of the scheme here for completeness, but we omit details as they are similar to the local-tier of
SP-hybrid [BFGL04, FinO5].
To support trace membership queries, XConflict organizes computation-tree nodes belonging to
a single trace as follows. Nodes are associated with their nearest nontransactional S-node ancestor.
These S-nodes are grouped into sets, called "trace bags." To be more precise, for each nontransac-
tional S-node S E nsNodes(C), XConflict creates bag called FBag(S) with the following property.
Let nsSet(S) = {S' G nsNodes(S) : nsParent[S'] = S} be the set of all nearest nontransactional
S-node descendants. Then for any S' E nsSet(S) n trace(U), we have FBag(S) = FBag(S')
if and only if S' has completed. In other words, FBag (S) contains all the descendant S-nodes of
procedures that have returned to S.
Each bag b has a pointer to a trace, denoted traceField[b], which must be maintained effi-
ciently. A trace may contain many trace bags. Bags are merged dynamically in a way similar to
the SP-bags [FL97] in the local tier of SP-hybrid [BFGLO4, FinO5] using a disjoint-sets data struc-
ture [CLRSO1, Chapter 21]. XConflict UNIONs bags when nontransactional S-nodes complete.
That is, when S' completes, we perform UNION(,S S') if and only S and S' still belong to the same
trace (i.e., if a steal has not occurred). Since traces execute on a single processor, we do not lock
the data structure on update (UNION) operations. The difference in our setting is that we use only
one kind of bag (instead of two in SP-bags).
When steals occur, a global lock is acquired, and then a trace is split into multiple traces, as
in the global tier of SP-hybrid [BFGL04, FinO5]. The difference in our setting is that traces split
into three traces (instead of five in SP-hybrid). It turns out that trace splits can be done in 0(1)
worst-case time by simply moving a constant number of bags. When the trace constant-time split
completes (including the split work in Sections 9.8 and 9.10), the global lock is released. Consider
the bags at the time a node S2 is stolen from the trace U with head[U] = S, and let S1 be 52'S
left sibling (the parent is a P-node). There are three new traces Uo, U1, and U2 created, rooted at
S, S1, and S2, respectively. Since CWSTM executes in a depth-first manner and performs steals
from the highest P-node, it follows that all descendant nodes of S belonging to U that are not
descendants of S1 (or S2) still belong to the bag FBag(S). Moreover, these are exactly the nodes
that belong to the resulting trace Uo. Thus, moving nodes to Uo is simply a matter of updating
traceField[FBag(S)] - U0. Since S2 is only now beginning to execute, its trace is initially
193
empty, and we simply perform MAKE-SET(FBag(S 2)) and traceField[FBag(S 2)] <- U2. There
may be many bags residing in U1, but these bags all previously belonged to U, so we can simply
rename U as U1. Thus, updating trace bags after a steal require only 0(1) pointer updates.
To query what trace a node B belongs to, we perform the operations traceField[FIND-BAG(nsParent [B]
These queries (in particular, FIND-BAG) take 0(1) worst-case time as in SP-hybrid [BFGL04,
FinO5]. Merging bags uses an UNION operation and takes 0(1) amortized time, but an optimiza-
tion [FinO5] gives a technique that improves, UNIONs to worst-case 0(1) time whenever the amor-
tization might adversely increase the program's critical path.
9.8 Highest Active Transaction
This section describes how XConflict finds the highest active transaction in a trace, used in line 9
of Figure 9.10 in 0(1) time.
For each nontransactional S-node S, we: have a field nextz [S] that stores a pointer to the nearest
active descendant transaction of S. Maintaining this field for all S-nodes is expensive, so instead
we maintain it only for some S-nodes as follows. Let S E U be an active nontransactional S-node
such that either S = head[U], or S is the left child of a P-node and S's nearest S-node ancestor
(which is always a grandparent) is a transaction. Then nextz [S] is defined to be the nearest, active
descendant transaction of S in U. Otherwise, nextx [S] = null.
Finding the highest active transaction simply entails a call to nextx[head[U]], which takes 0(1)
time. The complication is maintaining the nextx values, especially subject to dynamic trace splits.
To maintain nextx, we keep a stack of S-nodes in U for which nextx is defined. Initially push
head[U] onto the stack. For each of the following scenarios, let S be the S-node on the top of the
stack. Whenever encountering a transactional S-node X, check nextx [S]. If nextx[S] = null,
then set nextx[S] -- X. Otherwise, do nothing. Whenever completing a transaction X, check
nextx[S]. If nextz[S] X, then set nextx[S] +- null. Otherwise, do nothing. Whenever
encountering a nontransactional S-node S'. If nextx [S] = null, do nothing. Otherwise, push S'
onto the stack. Whenever completing a nontransactional S-node S', pop S' from the stack if it is on
top of the stack.
Finally, XConflict maintains these nextx values even subject to trace splits. Consider a split of
trace U into three traces U1, U2, and U3, rooted at S, S1, and S2, respectively. Since CWSTM steals
from the highest P-node in the computation tree, S1 must be the highest, active, nontransactional
S-node descendant of S that is the left child of a P-node. Thus, either S1 is the second S-node on
U's stack, or S1 is not on U's stack.
If S is on U's stack, then nextz [S] is defined to be an ancestor of S, and we leave it as such.
Moreover, since S1 is on the stack, nextx[S.i] is defined appropriately. Simply split the stack into
two just below S to adjust the data structure to the new traces. Suppose instead that S1 is not on U's
stack. Then the nextx [S] may be a descendant of S1 (or it is undefined). Set nextx [Si] +- nextx [S]
and nextx [S] +- null. Then split the stack below S, and prepend S1 at the top of its stack. The
necessary stack splitting takes O(1) worst-case time. This splitting occurs while holding the global
lock acquired during the steal (as in Section 9.7).
194
9.9 Supertraces
This section describes XConflict's data structure to find the highest completed ancestor trace of a
given trace (used as a subprocedure for line 5 in Figure 9.10, illustrated by U in Figure 9.12 (c)).
To facilitate these queries, XConflict groups traces together into "supertraces." Grouping traces
into supertraces also facilitates faster aborts-when aborting a transaction in trace U, we need only
abort some of the supertrace children of U, not the entire subtree in C. This section also provides
some details on performing the abort.
All update operations on supertraces take place while holding the same global lock acquired
during the steal (as in Sections 9.7, 9.8, and 9.10). Note that unlike the data structures in Sections
9.7, 9.8, and 9.10, the updates to supertraces do not occur when steals occur. To prove good
performance (in Section 9.11), we use the fact that the number of supertrace-update operations is
asymptotically identical to the number of steals. This amortization is similar to the "global tier" of
SP-hybrid [BFGLO4].
At any point during program execution, a completed trace U E Q belongs to a supertrace
K = strace(U) C Q. In particular, the traces in K form a tree rooted at some representative
trace rep [K], which is an ancestor of all traces in K. Our structure of supertraces is such that either
rep [strace(U)] is the highest completed ancestor trace of U (i.e., as used by line 5 in Figure 9.10),
or U has an aborted ancestor. We prove this claim in Lemma 9.4 after describing how to maintain
supertraces.
Supertraces are implemented using a disjoint-sets data structure [CLRSO 1, Chapter 21]. In par-
ticular, we use Gabow and Tarjan's data structure that supports MAKE-SET, FIND (implementing
strace(U)), and UNION operations, all in 0(1) amortized time when unions are restricted to a tree
structure (as they are in our case).
When a trace U is created, we create an empty supertrace for U (so strace(U) = 0). When
the trace completes (i.e., at a j oin operation), we acquire the global lock. We then add U to U's
supertrace (giving strace(U) = {U}). Next, we consider all child traces U' of U (in the tree of
traces J(C)).5 If head[U'] is ABORTED, then we skip U'. If head[U'] is COMMITTED, we merge the
two supertraces with UNION (strace(U), strace(U')). Thus, for U' (and all relevant descendants),
rep[strace(U')] = rep[strace(U)] = U. Once these updates complete, the global lock is released.
Later, U's supertrace may be merged with its parents, thereby updating rep [strace(U)].
A naive algorithm to abort a transaction X must walk the entire computation subtree rooted
at X, changing all of X's COMMITTED descendants to ABORTED. Instead, we only walk the sub-
tree rooted at X in U, not C. Whenever hitting a trace boundary (i.e., B E U, D E children(B),
D E U' -# U), we set that root of the child trace (D) to be aborted and do not continue into its de-
scendants. Thus, we enforce all descendants of B have a supertrace with an aborted representative.
The following lemma states that either the representative of U's supertrace is the highest com-
pleted ancestor trace of U, or U has an aborted ancestor.
Lemma 9.4 For any completed trace U E Q, let K = strace(U), and let U' = rep[K]. Exactly
one of the following cases holds.
1. Either head[U'] is ABORTED, or
5Maintaining a list of all child traces is not difficult. We keep a linked list for each node in the trace tree and add to
it whenever a trace splits.
195
2. head[U'] is COMMITTED, trace(parent[head[U']]) is active, and there is no ABORTED
transaction between head[U] and head[U'].
PROOF. We claim also that U' is an ancestor of all traces in K. We prove this claim and the lemma
inductively on the (tree of traces) height of the highest committed ancestor of U.
For a base case, consider the moment when U completes. Then K contains only descendants
of U, and rep [K] = U satisfying our claims.
Suppose that U's highest committed ancestor occurs at height h in the tree of traces, and assume
that U' = rep[K] satisfies the lemma. Let V be height-(h + 1) ancestor of U. Then we show that
the lemma is maintained when V commits. We divide the proof into the two cases.
Suppose head[U'] is ABORTED. Since U' is an ancestor of all traces in K, the only time U'
can be merged with another trace is when its parent completes. ABORTED children, however, are
not unioned. Thus, once the representative U' of a supertrace is aborted, the supertrace does not
change.
Suppose instead that head[U'] is COMMITTED. Then when V commits, XConflict unions the
two supertraces for U' and V, and then sets that supertrace's representative to V. Thus, rep [strace(U)]
is the highest completed ancestor trace of U. This trace may be COMMITTED or ABORTED, but
either satisfies the claim. O
9.10 Ancestor Queries
This section describes how XConflict performs ancestor queries. XConflict performs a "local"
ancestor query of two nodes belonging to the same trace (line 10 of Figure 9.10) and a "global"
ancestor query of two different traces (lines 11 and 15 of Figure 9.10). Both of these queries can
be performed in 0(1) worst-case time. The global lock is acquired only on updates to the global
data structure, which occurs on trace splits (i.e., steals).
Local ancestor queries
CWSTM executes a trace on a single processor, and each trace is executed in depth-first (e.g.., left-
to-right) order. We thus view a trace execution as a depth-first execution of a computation (sub)tree
(or a depth-first tree walk).
To perform ancestor queries on a depth-first walk of a tree, we associate with each tree node u
the discovery time d[u], indicating when u is first visited (i.e., before visiting any of u's children),
and thefinish time f[lu], indicating when u is last visited (i.e., when all of u's descendants have
finished). (This same labeling appears in depth-first search in [CLRS01, Section 22.3].) These
timestamps are sufficient to perform ancestor queries in constant time. To mark these times, we
increment a counter each time a node is labeled.
To query whether a node u is an ancestor of a node v, we simply need to compare the discovery
and finish times. The following lemma states a well known fact about these timestamps induced by
depth-first search
Lemma 9.5 Consider a depth-first tree walk, and let d[u] and f [u] be the discovery and finish
times, respectively, ofa node u. Then u is an ancestor ofv ifand only ifd[u] < d[v] and f[u] > f [v].
]
196
In the setting of XConflict, we need to performn these local ancestor queries while traces split
dynamically. This dynamic split of traces, however, does not significantly affect the timestamps
associated with each node within a trace. In particular, consider when a trace U splits into three
trace U, U1, and U2 . There is initially a (time) counter associated with U. After the split, we copy
the value of this counter to the counter associated with each of the resulting traces U, U1, and U2.
Thus, the relative discovery and finish times between nodes in a single trace remain correct.
Global ancestor queries
Since the computation tree does not execute in a depth-first manner, the same discovery/finish time
approach does not work for ancestor queries between traces. Instead, we keep two total orders
on the traces dynamically using order-maintenance data structures [DS87, BCD+02]. These two
orders give us enough information to query the ancestor-descendant relationship between two nodes
in the tree of traces. These total orders are updated while holding the global lock acquired during
the steal, as in Sections 9.7 and 9.8. Since our global ancestor-query data structure resembles the
global series-parallel-maintenance data structure in SP-hybrid [BFGL04], we omit the details of
the data structure. As in SP-hybrid, each query has a worst-case cost of 0(1), and trace splits have
an amortized cost of 0(1).
The two orders used are left-to-right order and right-to-left order. In our left-to-right order, a
node U precedes (all the nodes in) the left subtree of U, which precedes (all the nodes in) the right
subtree of U. In other words, we order the nodes (traces) in the order they are output in a preorder
tree walk that visits the children of a node in left-to-right order. In our right-to-left order, a node
U precedes its descendants, but now U's right subtree precedes U's left subtree. This ordering
corresponds to a preorder tree walk that visits the children of a node in right-to-left order. The left-
to-right order here matches the "English" ordering of SP-hybrid. SP-hybrid's "Hebrew" ordering,
however, only matches the right-to-left order at P-nodes.
Let <L denote precedence in the left-to-right order. That is, u <L v means that u precedes
v in the left-to-right order. Similarly, >R denotes precedence in the right-to-left order. Then the
following lemma states how to perform ancestor queries.
Lemma 9.6 Let U and V be two nodes in a tree. Then U is an ancestor of V if and only U <L V
and U <R V. ED
To maintain these orders dynamically, we use two order-maintenance data structures [DS87,
BCD 02], as in SP-hybrid. An order maintenance data structure supports the following operations:
1. OM-PRECEDES(O, x, y): Return TRUE if x precedes y in the ordering 0. Both x and y must
already exist in the ordering.
2. OM-INSERT-BEFORE(O, x, Y2, 2,..., Yk): In the ordering O, insert new elements yl, Y2, Yk,
in that order, immediately before existing element x.
3. OM-INSERT-AFTER(O, x, y1, y2, • • , Yk): In the ordering O, insert new elements yi, Y2 ... , yk,
in that order, immediately after existing element x.
197
The OM-PRECEDES operation can be supported in 0(1) worst case time. The OM-INSERT oper-
ations can be inserted in 0(1) worst-case time for each node inserted.6
XConflict uses order-maintenance data structures OL and OR to maintain the left-to-right and
right-to-left orderings, respectively, subject to dynamically created traces. Since these traces can be
created in parallel by multiple processors, we obtain a global lock during steals. Whenever a trace
U splits into three traces Uo, U1, and U2, we: obtain the lock and perform insert operations into each
ordering. (The traces U0 and U2 are added,, whereas U1 holds everything else that used to be part
of U.) Notice that the traces Uo and U2 are the parent and right sibling, respectively, of the existing
trace U = U1 .
One added difficulty is that at the time of the split, the resulting U1 may have several descendant
traces.7 To easily deal with this fact that U may have descendant's, at the time a trace is created,
we insert placeholder traces to help with future trace inserts. In particular, for each trace U, we
have placeholders U() and U(r) (surrounding the region of the orderings containing U and its
descendant's). Thus, all descendant's of U are inserted between U(e) and U(r), but all other nodes
are inserted outside these boundaries. These placeholders only increase the number of inserted
elements by a constant factor (3), so they have no affect on the asymptotic analysis.
When U splits, we perform the following insert operations. First, to insert the parent and new
parent placeholders, we perform OM-INSERT-BEFORE(OL, U( ) , U) , Uo, U(r)) and OM-INSERT-BEFORE(O
To insert the new right sibling and placeholders, we perform insert operations OM-INSERT-AFTER(OL, U(r), U
and OM-INSERT-BEFORE(OL, U(e), U2( ), U, U,()). Collectively, these inserts insert 12 elements,
for an amortized cost of O(1).
Note that correctness of the global ancestor queries relies on Property 4--the Cilk-like work-
stealing property that a thief processor steals a subtree from the highest available P-node owned by
the victim.
9.11 Performance Claims
The section bounds the running time of an CWSTM program in the absence of conflicts. The bound
includes the time to check for conflicts assuming that all accesses are writes and to maintain the
relevant data structures. Checking for conflicts with multiple readers, however, increases the run-
time. Additionally, aborts add more work to the computation. Those slowdowns are not included
in the analysis.
The following theorem states the running time of an CWSTM program under nice conditions.
We give bounds for both Cilk's normal randomized work-stealing scheduler, and for a round-robin
work-stealing scheduler (as in [FinO5]).
Theorem 9.7 Consider an CWSTM program with T work and critical-path length Too in which
all memory accesses are writes. Suppose the program, augmented with XConflict, is executed on P
and that no transaction aborts occur
1. When using a randomized work-stealing scheduler, the program runs in O(Ti/P + P(To +
Ig(1/E))) time with probability at least 1 - E, for any E > 0,
6For our purposes, 0(1) amortized is sufficient for inserts. The amortized data structure [DS87, BCD+02] is easily
implementable and still supports 0(1) worst-case queries.7In contrast, the SP-hybrid algorithm essentially splits traces only at leaves in the tree of traces, so there are no
descendant traces to worry about. The main reason for this difference is our different definition of traces.
198
2. When using a round-robin work-stealing scheduler the program runs in O(T /P + PT,)
worst-case time.
PROOF. The proof technique here is similar to the proof of performance of SP-hybrid in [FinO5].
The key insight in this analysis technique is to amortize the cost of updates of global-lock-protected
data structures against the number of steals. One important feature of XConflict's "global" data
structures is that they have O(# of traces) total update cost. Another is that whenever a steal attempt
occurs, the processor being stolen from is making progress on the original computation. (That is,
whenever stealable, a processor performs only 0(1) additional work for each step of the original
computation.) The proof makes the pessimistic assumption that while the global lock is held, only
the processor holding the lock makes any progress.
In XConflict, an insertion into the global ancestor datastructure requires a lock, that blocks all
queries to that datastructure. However, an insertion into the global ancestor datastructure happens
only when traces split, which occurs only on steals. Randomized work stealing analysis guarantees
that the number of steal attempts for a job with work T and span T, is O(P(T, + 1g(1/E)) time
with probability 1 - c. Therefore, the total number of trace splits is O(P(T, + lg(1/E)) time with
probability 1 - c. We assume worst case contention and stop all processors when any steal occurs.
Since the total work is T1, time to complete this work when no processor is stealing is at most
T 1/P. Therefore, the total completion time is O(T 1/P + P(T, + lg(1/E)). Note that all other
operations on XConflict take a constant amount of time, and do not require any locks. Ol
One way of viewing these bounds is as the overhead of XConflict algorithm itself. These bounds
nearly match those of a Cilk program without XConflict's conflict detection. The only difference
is that the T. term is multiplied by a factor of P. In most cases, we expect pT, < TI/P, so these
bound represents only constant-factor overheads beyond optimal. We would also expect the first
bound (using the randomized scheduler) to have better constants hidden in the big-O.
These XConflict bounds translate to bounds on completion time of an CWSTM program under
optimistic conditions. For illustration, consider a program where all concurrent paths access dis-
joint sets of memory. The overhead of maintaining the XConflict data structures is O(T1 /P+PT, ).
Each memory access queries the XConflict data structure at most once. Since each query requires
only 0(1) time, the entire program runs in O(TI /P + PT,) time.
The CWSTM design we describe does not provide any reasonable performance guarantees
when we allow multiple readers. There are two reasons for this problem. First, concurrent reads to
an object may contend on the access stack to that object. Second, even in the case where concurrent
read operations never wait to acquire an access stack lock, it appears that write operation may need
to check for conflicts against potentially many readers in a reader list (some of which may have
already committed). Therefore, a write operation is no longer a constant time operation, and it
seems the work of the computation might increase proportional to the number of parallel readers to
an object. It is part of future work to improve the CWSTM design and analysis in the presence of
multiple readers.
9.12 Discussion
The CWSTM model presented in Section 9.5 describes one approach for implementing a software
transactional memory system that supports transactions with nested fork-join parallelism. CWSTM
design was guided by a few major goals.
199
* Supporting nested transactions of unbounded depth.
* Small overhead when there are no aborts.
* Avoid asymptotically increasing the work or the critical path of the computation too much.
We believe that we have achieved these goals to some extent, since the CWSTM guarantees prov-
ably good completion time in the case when there are no aborts, and there are no concurrent readers
(or all accesses are treated as writes).
We believe that supporting unbounded nesting depth in transactions is important for compos-
ability of programs. If a function is called from inside a transaction, the caller should not have to
worry about how many transactions are nested inside the function call. However, it may be true that
the common case is transactions of small depth. In this case, a simpler design like the one described
in Section 9.4 might be sufficient in the common case, at the expense of a slowdown in the case
when the nesting depth is large. It is difficult, however, to conclude what the common case is, since
there are currently few examples of programs with nested parallel transactions. An important part
of future research would be to write series-parallel programs with nested transactions to understand
what the common case is, and what one should optimize for.
CWSTM is a TM system design and has not been implemented yet. As described in this paper,
each memory access may potentially require multiple data structure queries. CWSTM also may
have a large memory footprint. Due to lazy cleanup on aborts, and fast commits, the access stack
for an object may grow and require space proportional to the number of accesses to that object.
Also, access stacks may contain pointers to transaction logs that persist long after the transactions
are committed or aborted. Thus, a computation's memory footprint can become quite large. In
practice, implementing a separate, concurrent thread for "garbage-collection" of metadata may
help. As part of future work, we would like to implement the system in the Cilk runtime system to
evaluate its practical performance and explore ways to optimize the implementation.
It would be interesting to see if CWSTM-like mechanisms are useful for high-performance lan-
guages like Fortress [ACH+07] and X10 [ESSO5]. Both these languages support transactions and
fork-join parallelism. The language specification for Fortress also permits nested parallel transac-
tions. These are richer languages than Cilk, however, and may require more complicated mecha-
nisms to support nested parallel transactions.
9.13 Related Work
200
Chapter 10
Conclusions and Future Work
Properly designed concurrency platforms can simplify parallel programming by freeing program-
mers from worrying about the low-level details of parallel programming. The work presented in
this dissertation advances the technology in the design of concurrency platforms, primarily with
respect to scheduling and synchronization for dynamic multithreaded languages. A brief summary
of the contributions is as follows:
* Adaptive Scheduling: Chapters 2 and 3 present automatic schedulers that adapt to jobs'
changes in parallelism. This work provides a basis for building concurrency platforms where
programmers need not specify how many processors must be used to run their program,
thereby unburdening programmers from analyzing the parallelism of the program. In ad-
dition, these schedulers are more effective than nonadaptive schedulers for programs with
irregular parallelism.
* Dag Evaluation: Chapter 4 presents a concurrency platform for dag evaluations in the form
of a Cilk++ library. This library allows the programmer to specify dag evaluations easily and
automatically schedules these evaluations to take full advantage of both the inter-node and
inter-node parallelism of these computations.
* Helper Locks: Chapter 5 presents helper locks that allow programmers to express and exploit
parallelism inside large critical sections. In the context of this thesis, this work relates to both
scheduling and synchronization in concurrency platforms.
* Memory Models for TM and Open Nesting: Chapter 6open present work on understanding
the exact semantics of the synchronization mechanism of transactional memory. In particu-
lar, Chapter 6 presents the transactional computation framework and Chapter 7 presents the
description of the exact semantics of open-nested transactions.
* Ownership-Aware Transactions: Chapter 8 presents ownership-aware transactions, which
allow programmers to get the concurrency of open-nesting without its headaches. Therefore,
this work can be used as a basis of a TM concurrency platform that provides high concurrency
while also providing understandable semantic guarantees.
* Nested Parallelism in Transactions: Chapter 9 presents the first TM design that allows par-
allelism within transactions. This feature is important if TM is to be integrated into the
concurrency platforms that support dynamic multithreaded languages. Again, this work is
201
related to both scheduling and synchronization in concurrency platforms, since the design is
particularly suited for concurrency platforms that use work-stealing schedulers.
If multicores follow Moore's Law, the number of cores per chip will double every 18 months.
The primary challenge that multicores face is the difficulty of parallel programming. Concurrency
platforms can ease parallel programming. However, in order to gain wide acceptance, they must
also provide performance competitive with hand-tuned programs. This dissertation aims to provide
strong theoretical underpinnings for the design of such concurrency platforms.
Future Work
There are many aspects of parallel programming on multicores, such as locality and heterogeneity,
this dissertation does not consider. Both sequential computers and multicores have caches and
accessing data from the local cache is much cheaper than accessing data from main memory. For
sequential algorithms, researchers have developed both cache-aware [AV88, ACS87, BM72] and
cache-oblivious [FLPR99] strategies to minimize cache misses. However, this problem is even
more crucial for multicores since the memory latency may be even higher in multicores than serial
computers. There has been some recent work in understanding the cache complexity of parallel
algorithms [Val08, BCG+08]. However, it is just a drop in the bucket. In particular, we do not yet
completely understand how scheduling decisions taken by concurrency platforms impact the cache
performance.
In contrast to cache locality, which effects both sequential and parallel machine performance,
heterogeneity is an issue that primarily affects parallel machines. Multicores may exhibit het-
erogeneity at many levels. For example, they may exhibit heterogeneous communication: some
processors may be physically closer, and may therefore communicate faster with each other than
with others. How a program exploits this processor locality may be an important factor in its per-
formance. Multicores may also exhibit processor heterogeneity: different cores may have different
characteristics. Ideally, concurrency platforms should hide this heterogeneity from the programmer
and provide good performance by handling heterogeneity itself.
202
i
Appendices
203
.1 ON Model and Sequential Consistency
The transactional computation framework described in Section 6.2 admits only an a posteriori
analysis of a program execution. In this section, we describe how to extend the framework to
dynamically model a program execution. This dynamic model is used in Chapters 7, 8, and 9 in
order to model the particular design we model in those chapters. This chapter describes the general
modeling. We abstractly model the behavior of a generic transactional memory implementation as
a nondeterministic state machine which transitions between states as the program executes instruc-
tions.
204
I
.2 The OAT Model and Sequential Consistency
This appendix contains the details of the proof of Theorem 8.7: if the OAT model generates a trace
(C, 4) and a topological sort order S, then S satisfies Definition 12, i.e., S is sequentially consistent
with respect to 4.
In this appendix, we first define some useful notation for the proof. Next, we prove that the
OAT model preserves several invariants about memory operations, read set, and write sets. Finally,
we use these invariants to prove Theorem 8.7.
.2.1 Notation
We define some notation that is useful later for stating operational invariants of the OAT model.
For any subset S of nodes in the computation tree C, i.e., S c nodes(C), define
* low(S) = {X E S : pDesc(X) n S = 0}.
* high(S)-= {X E S : pAnces(X) n S = 0}.
Intuitively, low(S) represents the nodes in S closest to the leaves of the tree. Similarly, high(S)
represents the nodes in S closest to the root of the tree. In cases where the set S is guaranteed to
fall along one root-to-leaf path in the tree, we define lowest(S) as the only element X E low(S).
Similarly, we define highest (S) as the only element in high(S).
We also define two time-dependent sets of transactions.
* The reader set readers(t) () = (T E activeX(t)(C) : g E R(t, T)}.
* The writer set, writers(t)(e) = {T E activeX(t)(C) : l W(t, T)}.
Said differently, readers(t) () is the set of active transactions at time t which have location f in
their read set. Similarly, writers(t) (f) is the set of active transactions at time t with £ E W(T).
Next, we generalize the content sets from Definition 23 and define a set of dynamic content
sets.
Definition 38 At any time t, for any transaction T E xactions(t, C) and a memory operation u E
mem0ps(t, C), define the sets cContent(t, T), oContent(t)(T), aContent(t, T), andvContent(t, T)
according the ContentType(t, u, T) procedure:
ContentType(t, u, T) > For any u E memOps(t, T)
1 X +- xparent [u]
2 while (X 7 T)
3 ifX E activeX(t)(C), return u E vContent(t, T)
4 ifX E aborted(t,C), return a E aContent(t, T)
5 if(X = committer(u)) return u c oContent(t)(T)
6 X +- xparent[X]
7 return u E cContent(t, T)
205
The difference between Definition 38 and the previous statement in Definition 23 is that for dy-
namic content sets, if we encounter a PENDING or PENDINGABORT transaction when walking
up the tree from a memory operation u to a transaction T, we place u in the active content of
T, i.e., u e vContent(t, T). If a transaction T completes at time t*, it is not hard to see that
the dynamic classification ContentType(t, u, T) gives the same answer as the static classification
ContentType(u, T) for all times t > t*.
.2.2 OAT Model Invariants
Because the OAT model performs eager conflict detection according to Definition 21, it is not hard
to prove the following invariant about the readers and writers to a particular memory location f.
Theorem .1 At all times t, for all memory locations G MA, the OAT maintains the following
invariants on the sets readers() and writers(f):
1. For all 1 E MA, low(writers(t)(g)) 1, i.e., lowest(writers(t)()) exists.
2. ForanyT e readers(t)(), either lowest(writers(t)(f)) G desc(T) orT G desc(lowest(writers()
PROOF. The proof is by induction on the instructions that the OAT model issues.
In the base case, for all locations f e A4, we begin with readers()(£) = writers() ()
{root(C) }, and no other nodes in the computation tree C except root(C). Thus, Invariants 1 and 2
are satisfied.
In the inductive step, suppose at time t - 1, Invariants 1 and 2 are satisfied. A read or write
instruction at time t can not break the invariants without causing a conflict according to Defini-
tion 21. Therefore, successful read and write operations preserve the invariant. An unsuccess-
ful read or write operation can only trigger the sigabort of transactions, which does not
affect either invariant.
An xend instruction that commits a transaction T can only add the transaction xparent [T] to
readers(e) or writers(e). Since xparent[T] is an ancestor of T, it can not break either of the
two invariants.
The remaining instructions preserve Invariants 1 and 2 trivially. A fork or join instruction
at time t preserves the invariants because they do not change the set active transactions or any
transaction read sets or write sets. An xbegin preserves the invariants because it creates new
transactions T with empty read sets and write sets. The xabort instruction preserves the invariants
because it can only remove transactions from readers(t) (f) or W(t, £). []
The following invariant shows that, informally, the read sets of transactions act as caches for
pairs (f, u) stored in write sets.
Lemma.2 At any time t, for any T E readers(t) (), suppose (f, u) E R(t, T). Let T' =
lowest(xAnces(T) n writers(t)(f)). Then (, u) C W(t,T').
PROOF. The proof is by induction on the instructions issued by the OAT model. In the base
case, we know for all memory locations f G M,, we start with readers(')() = writers(o)(f) =
{root(C)} and R(root(C)) = W(root(C)). Since T' = T = root(C), Lemma .2 is satisfied in the
base case.
206
For the inductive step, assume the lemma is satisfied at time t - 1. We show after any S-node
X issues an instruction at time t, the lemma is still satisfied.
For any T E xactions(t- 1, C), after a fork, j oin, or xbegin instruction in step t, we have
R(t, T) - R(t -1, T) and W(t, T) = W(t - 1, T). Thus, the lemma is satisfied after these instructions.
An xbegin which creates a new transaction X at time step t starts with R(t, X) - W(t, X) = 0;
thus, the lemma is satisfied.
Next, consider an xabort issued by X E xactions(t - 1, C). Suppose, before the xabort
of X there exists a transaction T E readers (- 1)(f) with (f,u) E R(t - 1,T). Let T' =
lowest(xAnces(T) nwriters(t-1)()). Then before the xabort, (f, u) E W(t - 1, T'). Assume for
contradiction after the xabort of X, that there exists some transaction T E xactions(t, C) such
that the invariant no longer holds for T, i.e., we no longer have (f, u) E W(t, T'). Since an xabort
does not change the contents of any transaction's write set, but removes X from writers(f), the
only way to violate the invariant is if X = T'. Consider two cases: either X = T' = T, or
X = T' $ T. In the first case, we can not violate the invariant for T because T is aborted and
removed from readers(f). In the second case, we must have T E pDesc(X). But then, before the
xabort, we have T E pDesc(X) n activeN(t - 1, C) and X E ready(t - 1, C), contradicting
the property that the ready nodes are the leaves of tree of active nodes. Thus, the xabort must
preserve the invariant.
A successful read operation v observes the value from the closest transactional ancestor X
which has location f in its read set. The only transaction whose read set changes is xparent[v].
The invariant is preserved because xAnces(xparent[v]) D xAnces(X), and since the read does
not change any write sets.
A successful write operation v only changes the write set of xparent[v]; this write can not
break the invariant without generating a conflict.
Finally, suppose at time t, a ready node X issues an xend. Consider two cases:
1. X = owner(L). The only transaction Y which has its read set or write set change after the
xend (i.e., for which we could have R(t, Y) # R(t - 1, Y) or W(t, Y) 7 W(t - 1, Y)) is
Y = xparent[X]. The xend merges X's read and write sets into Y's read and write sets,
respectively; using Theorem .1, it is straightforward to show that the invariant is preserved
for Y.
For all other transactions T E readers(t) (f) with T = Y, since the read set or write set of T
or any transaction in xAnces(T) remains the same, the invariant is still preserved for T.
2. Suppose X = owner(E). Then, the only transaction whose read set or write set can change
is Y = root(C). But the only way to break the invariant is if X commits a pair (f, v) from
W(t - 1, X) to root(C), which corrupts the version (£, u) E R(t - 1, T), for some transaction
T parallel to X. But then, we would violate Theorem .1, and should have had a conflict
earlier.
Since all possible choices for action k + 1 preserve the invariant, the lemma holds by induction.
O
Theorem .3 characterizes when a transaction should have a location in its write set.
Theorem .3 At any time t, consider any transaction T E activeX(t) (C) and any memory location
f such that xid(owner(e)) < MT. Let Se(t) = { u G mem0ps(t, C) : W(u, £)}. Exactly one of the
following cases holds:
207
1. T = root (C), (f, 1) E W(t, T), and two conditions are satisfied:
(a) cContent(t, T) n S = 0.
(b) For all v E Se(t), we have v E aContent(t, T) U vContent(t, T).
2. There exists an (0, u) E W(t, T) which happens at time tu, and two conditions are satisfied:
(a) u E cContent(t, T) n St(t)
(b) For any operation v E (St(t) - {u}) which happens at time tv, where tu < t, < t, we
have v E aContent(t, T) U vContent(t, T).
3. We have f 0 W(t, T), and cContent(t, T) n St(t) = 0.
PROOF.
This theorem can be proved by a straighforward, albeit tedious, induction on time.
Note that because we assume xid(owner(f)) < MT, S(t) n oContent(t)(T) = 0, i.e., the
theorem is only concerned with memory locations £ which belong to T's open content. Because of
the properties of ownership and Xmodules, any location f with xid(owner(f)) > MT can never
propagate into T's write set anyway. []
The intution for Theorem .3 lies mostly in Case 2; if at time t a pair (0, u) is the write set of
a transaction T, then u is the last write to £ in T's subtree which is "committed with respect to"
T. Any v which writes to f after t, (the time u occurs) must belong to T's subtree; otherwise,
there would have been a conflict. Furthermore, any v which happens after t, must still be aborted
or pending with respect to T (i.e., v E aContent(t, T) U vContent(t, T)); otherwise, v should
replace u in T's write set.
Case 3 says the write set of T does not. contain a location 0 if no memory operation in T's
subtree commits f to T. Case 1 of Theorem .3 handles the special case of the root.
.2.3 Proof of Sequential Consistency
Finally, we can use the invariants from Lemma .2 and Theorem .3 to prove Theorem 8.7.
PROOF. [Theorem 8.7]
The first condition and second conditions are true by construction, since the OAT model can
only set ND(v) = u if u <s v, W(u, f) and R(v, 0) V W(v, 0).
To check the third and fourth conditions, we require some setup. Suppose at time tv, mem-
ory operation v happens and the OAT model sets 1(v) = u. Let A = lowest(readers(t) (0) n
ances(v)). Because the OAT model sets 4?(v) = u, we must have (0, u) c R(t, A). Let T =
lowest(xAnces(A) n writers(t)(f)). By Lemma .2, we know (f, u) c W(t, T). By Theorem .3,
since (0, u) E W(t, T), we know u E cContent(t, T). Let X = xLCA(u, v). We must have
T E ances(X); otherwise, we could not have {u, v} C memOps(t, T).
Since u E cContent(t, T), we know u E cContent(t, X) UoContent(t)(X). Therefore, we have
-(uHv), satisfying the third condition.
To check the fourth condition, assume for contradiction that there exists a w such that W(w, 0),
and u <s w <s v. Let t, be the time that v happens. Then, since Q (v) = u, we know u E W(t,, T).
Therefore, by Theorem .3 we know w E mem0ps(tv, T), w E aContent(tv, T) U vContent(tv, T).
Let Y = xLCA(w, v). Since w E mem0ps(t,, T), we know T E ances(Y). Consider the two
cases for w:
208
____ ~_ill I~___I~_l___;~.._iiiilii__i iii. :ii:;I:ili:(iliI=:: - i;i~-_~i;i-l i_ --_-~ . I----_iii_ ii;;;iili:-i::i-iii-i:lii-~ili-: ..- _. ~i-Xi-
1. Suppose w E aContent(tv, T). Since T ZE ances(Y), we know w E cContent(t0 , Y) U
aContent(t,, Y).
We can show by contradiction that we must have w E aContent(tv, Y). If Y = T, then
we already know w E aContent(t,, Y). Otherwise, assume T E pAnces(Y). If we had
w E cContent(tv, Y), then by Theorem .3, we must have (f, y) E W(t, Y). This statement
contradicts the fact that OAT model found ( , u) from transaction T, since a closer transaction
Y had e in its read set.
But then, since w E aContent(t,, Y), we have wHy.
2. Suppose w E vContent(tv, T):
Then, we know w E cContent(tv, Y) U vContent(t,, Y). As in the previous case, we can
show w 0 cContent(tv, Y).
If w E vContent(t,, Y), then there exists some transaction Z E activeX(t) (Y) - {Y} such
that £ E W(tv, Z).
Since w E mem0ps(t, Z), we can strengthen this condition to Z E activeX(tv) (LCA(w, v)) -
{LCA(w, v)}. This statement leads to a contradiction, however, because w E W(t, Z) must
conflict with v.
More formally, by Invariant 2 of Theorem .1, any new read operation v at time tv must satisfy
v E desc(low(writers(tv)(£))) ( i.e., v is a descendant of the base of the spine for £). At
time tv, however, we must have low(writers(t-)(£)) E desc(Z).
O
209
.3 Rules for Type Checking the OAT Type System
This appendix contains the type rules for the OAT type system. The grammar for the type system
is shown below.
P ::= defn* e
defn ::= class ocn(formal+) extends oc
where constr* {field* nmeth*} I
class xcn(formal+) extends xc
where constr* {xfield* meth* }
oc ::= ocn(owner+) I Object(owner)
xc ::= xcn(owner+) I Xmodulle(owner)
owner ::= world[i] I formal I thiisi]
constr ::= (owner > owner) (owner owner)
(owner = owner) (owner owner)
meth ::= t mn(formal*)(arg*) where constr*{e}
field :: t fd
xfield ::= c fd
arg ::= tx
t ::= c int
formal ::= f
e ::= new c x x=e|
let (arg = e) in {e} I
x.fd I x.fd=y I x.mn(owner*) (y*)
ocn E class names that are not subtype of Xmodule
xcn E class names that are subtype of Xmodule
fd E field names
mn E method names
x, y E variable names
f E owner names
i, j E type int literals
We define a number of predicates used in the type system. These predicates are adapted from
[BLS03], but our type system does not handle inner classes for now.
210
------- ---- ..... 
Prediate Manin
WFClasses(P)
ClassOnce(P)
FieldsOnce(P)
MethodsOnce(P)
OverridesOK(P)
WorldInMainOnly(P)
ThislnXcOnly(P)
There are no cycles in the class hierarchy
No class is declared twice in P
No class contains two fields, decalred
or inherited with the same name
No class contains two methods with
the same name
Overriding methods have the same
return type and parameter types as the
methods being overridden.
Only the main method uses the
world tag to initialize owner.
Only classes that are subtype of
Xmodule use this tag to initialize owner.
Our typing judgment follows the form adapted from [BLS03]: P; E - e : t, where P is the
program being checked to provide information about class definitions; E is an environment provid-
ing type information for the free variables in e; finally, t is the type of e.
The typing environment is defined as
E ::=0 I E, tx I E, owner f E, constr
The typing environtment contains the the declared types of variables, the decalred owner pa-
rameters, the declared constraints among owners, and certain inferred constraints, such as this[i] =
this[j] when they are used in a Xmodule class definition.
The typing system uses the following judgments.
Judgment
- P: t
P K- defn
P; E - constr
P; E K (ol = 02)
P; E -owner o
P; E K wf
P; E t
P; E K tl <: t 2
P; E K t1 <:= t 2
P K xfield E xc
P K field E oc
P; E K field
P K meth C xc
P K meth E oc
P; E K meth
P; E K e:t
Meaning
program P yields type t
defn is a well-formed class
constraint constr is satisfied
ol and 02 represent the same owner instance
o is an owner
typing environment E is well-formed
t is a well-formed type
t 1 is a subtype of t 2
t 2 is assignable to tl
Xmodule class xc declares/inherits x field
non-Xmodule class oc declares/inherits field
field is a well-formed field
Xmodule class xc declares/inherits meth
non-Xmodule class oc declares/inherits meth
meth is a well-formed method
expression e has type t
211
I _
I
Predicate Meaning
I
We present the type rules for these judgments in the following pages.
212
-P: tPR- OG][PROG]
ClassOnce(P)
ThislnXcOnly(P)
WFClasses(P)
FieldsOnce(P) AlethodsOnce(P)
WorldlnMainOnly(P)
P = defnl.., e P F- defni
F P:t
P - defn
[CLASS]
E = ocn(fi..n) this, owner fi..n, f, r> fi, constr*
P; E w uf P; E - oc' P; E F ffiedd P; E - methi
P I class ocn(fi..) extends oc' where constr* { field* meth*}
[XMODULE CLASS]
E = xcn(fl..n) this, owner fl..n, ft > fi, constr*
P; E I- wf P; E I- xc' P; E F- fieldi P; E t- meth
P H class xcn(fl..,) extends xc' where constr* {xfield* meth*}
P; E constr
[CONSTR ENV]
E = El, constr, E2
P; E t- constr
P; E- (01 =02)
[= OWNER]
[1> WORLD]
P; E -owner o
P; E - (o > world)
[= REFL]
[> OWNER]
P; E - e : xn(ol..)
P; E F- (e > o01)
[> REFL]
P; E -,owner o
P; E - (o > o)
[= TRANS]
E = El, xcthis, E2
P; E - (this[i] = this[j])
P; E- owner 0
P; E F- (o =o)
P; E - (ol = 02)
P; E (02 = 03)
P; E F (01 = 03)
213
OverridesOK(P)
P; 0 F- e: t
[> TRAN
P; E F
P; EF
P; E 
-
P; E Fownv o]
[OWNER WORLD]
P; E -owner world
P; EF wf
[ENV 0]
[OWNER FORMAL]
E = E1 , owner f, E 2
P; E Fowner f
[ENV X]
P; E - t
x / Dom(E)
P; E - wf
P; 0 F- wf P; E, tx I- wf
[OWNER THIS]
E = El, xc this, E2
P; E F-owner this[i]
[ENV CONSTR]
constr = (o > o') V constr =
P; E F- wf P; E F-owner o, o'
x,y (P; E' F- xr> y) A (P; E'
P; E, constr P wf
[ENV OWNER]
f A Dom(E)
P; E F- wf
P;, E, owner f - wf
[TYPE OBJECT]
P; E F-ower o
P; E F Object(o)
[TYPE OC]
P I- class ocn(f..,) ... where constr* ...
P; E F-owner oi P; E F ol > oi P; E F constr [oz/i
P; E F ocn(o..,)
[TYPE XMODULE]
P; E Fowner o
P; E F Object(o)
P; E - ti <: t 2
[SUBTYPE REFL]
P; E t
P; E P t <: t
[TYPE XC]
P; E F-owner
P F- class xcn(fi..) ... where constr* ...
oi P; E F- ol > oi P; E F constr [oz/fi]..[o/f,]
P; E - xcn(ol..,)
[SUBTYPE TRANS]
P; E F- t1 <: t2
P; E I- t 2 <: t 3
P; E F- tl <: t
214
P; E- t
[TYPE INT]
P; E F int
i ; ___; _ _ _)_ I( _ __;_ ^ _  ;iiii____il___ l_ ^)_
[SUBTYPE OC]
P; E - xcn(Ol..k..n)
P H class xcn(fl..k..n) extends xcn'(fi C)..k).
P; E x cn(1..k..n) <: cn'(f 02..k) [o1/fl].[On 1/fn]
P; E t1 <:= t 2
[ASSIGNABILITY REFL] [ASSIGNABILITY TRANS]
P; E t
P; E - t <: = t
P; E H ocn(ol..k..r)
P H class ocn(fl..k..n) extends ocn'(fio
P; E ocn(0..k..n) <: OCIn'fl 02..k) [o01/f
P; E ti <:= 12
P; E t2 <:= 13
P; E t 1 <:= t3
[ASSIGNABILITY FOR XC] [ASSIGNABILITY FOR OC]
P; E H xcn(o'..,)
P; E F (oi > ol)'I.-
P; E cn(o1..) <:= xcn(o'.. )
P H xfield E xc
[XFIELD DECLARED]
P - class xcn(fl..n)... {... xfield ...}
P x field E xcn(fi..n)
P H field E oc
[FIELD DECLARED]
P - class ocn(fl..n)... {... field ... }
P H field E ocn(fl..n))
P H meth C xcl
[METHOD DECLARED IN XC]
P - class xcn(fl..n)... {... meth ...}
P H meth E xcn(.f..n)
P; E H ocn(o.n) P; E H ocn(o',,)
P; E (o oi)El.. P; E H (oi > o')
P; E H ocn(o,..) <:= ocn(o' ..)
XFIELD INHERITED]
P H x field E xcn(fi.. 1 )
P - class xcn'(gl..,m) extends xcn(ol..,,)...
P H xfield [oi/fl]..[oln/f] E xcnr'(g..m)
[FIELD INHERITED]
P H field E ocn(f..n)
P H class ocn'(g..n) extends ocn(o..,)...
P H field [ol/f]..[on/fn] E ocn'(gl..m)
[METHOD INHERITED BY XC]
P H meth E xcn(fl..n) P H class xcn,'(gi..,) extend
PE H meth [ol/fi]..[o~/f] E xcn'(l91..m)
215
P; E - xcn(ol..n))
P; E H (oi -0 )i El ..n
P; E J
[FIELD]
P; F 
P; E - t
[SUBTYPE XC]
P H meth E oc
[METHOD DECLARED IN OC]
P - class ocn(f..n)... {... meth ... }
P H meth E ocn(fi..n)
[METHOD INHERITED BY OC]
P t meth E ocn(fl..) P H class ocn'(gl..,m) extends
P F meth [ol/fi]..[o,/f,] E ocn'(gi..m)
P; E F meth
[METHOD]
E' = E, owner fl..,, constr*, arg '
P; E' F wf P; E' e:t
P; E H t mn(f..)(arg*) where constr* {e}
P; E e: t
[EXP TYPE]
P; E F tP; E e:t
F; F H e :t
[EXP SUB]
P; E F e:t'
P; E H t' <: t
P; EF e: t
[EXP NEW]
P; Ec
P; E c : c
P; E F e: t
[EXP LET]
arg = t 1 x
P; E F t, <:= t
[EXP VAR]
P; E H e1 : t'
P; E, arg F ez :t2
P; E H let (arg = el) in {e 2 } :t2
E = El, tx, E2
P; E x: t
[EXP VAR ASSIGN]
P; E x: t P; E e: t'
P; E - t <:= t'
P; EF x = e: t
[EXFP REF ASSIGN]
P; E Fx : cn(ol..,)
P H (tfd) E cn(fl..,)
P; E - x.fd : t [ol/fi]..[on/f,]
P; E F x: cn(ol..n) P H (t fd) E cn(fl..)
P; E H y: t' [ol/fi]..[on/fn ] P; E - t <:= t'
P; E H x.fd = y: t [ol/fj]..[on/fn]
[EXP INVOKE]
P H (t mn(f(n+l)..m)(tj j yjiE..) where constr* ...) E cn(fl..)
P; E - x: cn(Ol..) F; E - xy: t [o/ fil..[om/fm]
P; E H ol > oi P; E H constr [o:t/fi]..[om/fm] P; E F (tj <:= tI)JEl..k
P; E - x.mn(o(n,1).,)(X1..k) : t [o01/f1]..1Om/f
216
[EXP REF]
-i-i i.... ii--l--i...;...-. ._. .. .ii~
Bibliography
[AAKt05] C. Scott Ananian, Krste Asanovic, Bradley C. Kuszmaul, Charles E. Leiserson, and
Sean Lie, Unbounded transactional memory, Proceedings of the International Sym-
posium on High-Performance Computer Architecture (HPCA) (San Francisco, Cali-
fornia), February 2005, pp. 316-327.
[ABBO0] Umar A. Acar, Guy E. Blelloch, and Robert D. Blumofe, The data locality of work
stealing, Proceedings of the ACM Symposium on Parallel Algorithms and Architec-
tures (SPAA) (Bar Harbour, Maine), 2000, pp. 1-12.
[ABP98] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton, Thread scheduling for
multiprogrammed multiprocessors, Proceedings of the ACM Symposium on Parallel
Algorithms and Architectures (SPAA) (Puerto Vallarta, Mexico), June 1998, pp. 119-
129.
[ACH+07] Eric Allen, David Chase, Joe Hllett, Victor Luchango, Jan-Willem Maessen, Suky-
oung Ryu, Guy L. Steele Jr., and Sam Tobin-Hochstadt, The Fortress language spec-
ification, version 1.0 /3, Tech. report, Sun Microsystems, Inc., March 2007.
[ACS87] Alok Aggarwal, Ashok K. Chandra, and Marc Snir, Hierarchical memory with block
transfer, Proceedings of the Annual Symposium on Foundations of Computer Sci-
ence (FOCS) (Los Angeles, California), IEEE, 12-14 October 1987, pp. 204-216.
[AFS08] Kunal Agrawal, Jeremy T. Fineman, and Jim Sukha, Nested parallelism in transac-
tional memory, Proceedings of the ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming (PPoPP) (Salt Lake City, Utah, USA), February
2008.
[AHHL06] Kunal Agrawal, Yuxiong He, Wen Jing Hsu, and Charles E. Leiserson, Adaptive task
scheduling with parallelism feedback, Proceedings of the Annual ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming (PPoPP) (New York
City, NY, USA), March 2006.
[AHHL08] , Adaptive work stealing with parallelism feedback, ACM Transactions on
Computer Systems 26 (2008), no. 3.
[AHL06] Kunal Agrawal, Yuxiong He, and Charles E. Leiserson, An empirical evaluation of
work stealing with parallelism feedback, Proceedings of the International Conference
on Distributed Computing Systems (ICDCS) (Lisboa, Portugal), July 2006.
217
[AHL07] - , Adaptive work stealing with parallelism feedback, Proceedings of the An-
nual ACM SIGPLAN Symposium on Principles and Practice of Parallel Program-
ming (PPoPP) (San Jose, CA),, March 2007.
[AHS94] James Aspnes, Maurice Herlihy, and Nir Shavit, Counting networks, Journal of the
ACM 41 (1994), no. 5, 1020-1048.
[ALS06] Kunal Agrawal, Charles E. Leiserson, and Jim Sukha, Memory models for open-
nested transactions, Proceedings of the Workshop on Memory Systems Performance
and Correctness (MSPC), October 2006, In conjunction with International Confer-
ence on Architectural Support for Programming Languages and Operating Systems
(ASPLOS).
[ALS08] Kunal Agrawal, I-Ting Angelina Lee, and Jim Sukha, Safe open-nested transac-
tions through ownership, Tech. Report MIT-CSAIL-TR-2008-038, MIT CSAIL, June
2008, Available online at
http: //supertech. csail.mit. edu/~angelee/safeTech.pdf.
[ALS09] , Safe open-nested transactions through ownership, Proceedings of the ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)
(Raleigh, NC, USA), February 2009.
[Art09] Cilk Arts, Cilk++ programmer's guide, http: / /www. c i 1k. com, April 2009.
[AV88] Alok Aggarwal and Jeffrey Scott Vitter, The input/output complexity of sorting and
related problems, Communications of the ACM 31 (1988), no. 9, 1116-1127.
[Bar93] Greg Barnes, A method for implementing lock-free shared data structures, Pro-
ceedings of the Annual ACM Symposium on Parallel Algorithms and Architectures
(SPAA) (Velen, Germany), June 1993, pp. 261-270.
[BBG89] Catriel Beeri, Philip A. Bernstein, and Nathan Goodman, A model for concurrency
in nested transactions systems, Journal of the ACM 36 (1989), no. 2, 230-269.
[BCD+02] Michael A. Bender, Richard Cole, Erik Demaine, Martin Farach-Colton, and Jack
Zito, Two simplified algorithms for maintaining order in a list, Proceedings of the
European Symposium on Algorithms (ESA) (Rome, Italy), 2002, pp. 152-164.
[BCG+08] Guy E. Blelloch, Rezaul A. Chowdhury, Phillip B. Gibbons, Vijaya Ramachandran,
Shimin Chen, and Michael Kozuch, Provably good multicore cache performance
for divide-and-conquer algorithms, Proceedings of the ACM-SIAM Symposium on
Discrete Algorithms (SODA), (Philadelphia, PA, USA), 2008, pp. 501-510.
[BCH+94] Guy E. Blelloch, Siddhartha Chatterjee, Jonathan C. Hardwick, Jay Sipelstein, and
Marco Zagha, Implementation of a portable nested data-parallel language, Journal
of Parallel and Distributed Computing 21 (1994), no. 1, 4-14.
[BDKS04] Nikhil Bansal, Kedar Dhamdhere, Jochen Konemann, and Amitabh Sinha, Non-
clairvoyant scheduling for minimizing mean slowdown, Algorithmica 40 (2004),
no. 4, 305-318.
218
[BEGCS74] John L. Bruno, Jr. Edward G. Coffman, and Ravi Sethi, Scheduling independent tasks
to reduce meanfinishing time, Communications of the ACM 17 (1974), no. 7, 382-
387.
[BFGL04] Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, and Charles E. Leiserson, On-
the-fly maintenance of series-parallel relationships in fork-join multithreaded pro-
grams, Proceedings of the Annual ACM Symposium on Parallel Algorithms and Ar-
chitectures (SPAA) (Barcelona, Spain), June 2004, pp. 133-144.
[BFJ+95] Robert D. Blumofe, Matteo Frigo, Christopher F. Joerg, Bradley C. Kuszmaul,
Charles E. Leiserson, Rob Miller, Keith H. Randall, and Yuli Zhou, Cilk 2.0 ref-
erence manual, Massachusetts Institute of Technology Laboratory for Computer Sci-
ence, 545 Technology Square, Cambridge, Massachusetts 02139, June 1995.
[BG96] Guy E. Blelloch and John Greiner, A provable time and space efficient imple-
mentation of NESL, International Conference on Functional Programming (ICFP)
(Philadelphia, Pennsylvania), 1996, pp. 213-225.
[BGM95] Guy E. Blelloch, Phillip B. Gibbons, and Yossi Matias, Provably efficient schedul-
ing for languages with fine-grained parallelism, Proceedings of the Annual ACM
Symposium on Parallel Algorithms and Architectures (SPAA) (Santa Barbara, Cali-
fornia), July 1995, pp. 1-12.
[BGM99] Guy Blelloch, Phil Gibbons, and Yossi Matias, Provably efficient scheduling for lan-
guages with fine-grained parallelism, Journal of the ACM 46 (1999), no. 2, 281-321.
[BJK+95] Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiser-
son, Keith H. Randall, and Yuli Zhou, Cilk: An efficient multithreaded runtime sys-
tem, Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPoPP) (Santa Barbara, California), July 1995, pp. 207-216.
[BJK+96] , Cilk: An efficient multithreaded runtime system, Journal of Parallel and
Distributed Computing 37 (1996), no. 1, 55-69.
[BL97] Robert D. Blumofe and Philip A. Lisiecki, Adaptive and reliable parallel computing
on networks of workstations, Proceedings of the USENIX Annual Technical Confer-
ence on UNIX and Advanced Computing Systems (Anaheim, California), January
1997, pp. 133-147.
[BL98] Robert D. Blumofe and Charles E. Leiserson, Space-efficient scheduling of multi-
threaded computations, SIAM Journal on Computing 27 (1998), no. 1, 202-229.
[BL99] , Scheduling multithreaded computations by work stealing, Journal of the
ACM 46 (1999), no. 5, 720-748.
[BLMO5] Colin Blundell, E. Christopher Lewis, and Milo M. K. Martin, Deconstructing trans-
actions: The subtleties of atomicity, Proceedings of the Annual Workshop on Dupli-
cating, Deconstructing, and Debunking (WDDD), Jun 2005.
219
[BLM06] Colin Blundell, E Christopher Lewis, and Milo M. K. Martin, Subtleties of transac-
tional memory atomicity semantics, Computer Architecture Letters 5 (2006), no. 2.
[BLS] Robert D. Blumofe, Charles E. Leiserson, and Bin Song, Automatic processor allo-
cation for work-stealing jobs,, Unpublished Manuscript.
[BLS03] Chandrasekhar Boyapati, Barbara Liskov, and Liuba Shrira, Ownership types for ob-
ject encapsulation, Proceedings of the ACM Symposium on Principles of Program-
ming Languages (POPL) (New Orleans, Louisiana), January 2003.
[Blu95] Robert D. Blumofe, Executing multithreadedprograms efficiently, Ph.D. thesis, De-
partment of Electrical Engineering and Computer Science, Massachusetts Institute of
Technology, Cambridge, Massachusetts, September 1995, Available as MIT Labora-
tory for Computer Science Technical Report MIT/LCS/TR-677.
[BM72] Rudolf Bayer and Edward M. McCreight, Organization and maintenance of large
ordered indexes, Acta Informaica 1 (1972), no. 3, 173-189.
[Boa08] OpenMP Architecture Review Board, OpenMP specification and features,
http: //openmp. org/wp/, May 2008.
[BP94] Robert D. Blumofe and David S. Park, Scheduling large-scale parallel computations
on networks of workstations, Proceedings of the International Symposium on High
Performance Distributed Computing (HPDC) (San Francisco, California), August
1994, pp. 96-105.
[BP98a] Robert D. Blumofe and Dionisios Papadopoulos, The performance of work stealing
in multiprogrammed environments, Tech. Report TR-98-13, The University of Texas
at Austin, Department of Computer Sciences, May 1998.
[BP98b] , The performance of vork stealing in multiprogrammed environments, SIG-
METRICS, 1998, pp. 266-267.
[BP99] , Hood: A user-level threads library for multiprogrammed multiprocessors,
Technical Report, University of Texas at Austin, 1999.
[Bre74] Richard P. Brent, The parallel .evaluation of general arithmetic expressions, Journal
of the ACM 21 (1974), no. 2, 201-206.
[BS81] F. Warren Burton and M. Ronan Sleep, Executing functional programs on a virtual
tree ofprocessors, Proceedings of the Conference on Functional Programming Lan-
guages and Computer Architecture (FPCA) (Portsmouth, New Hampshire), October
1981, pp. 187-194.
[CB01] Walfredo Cirne and Francine Berman, A model for moldable supercomputer jobs,
Proceedings of the Internatinal Symposium on Parallel and Distributed Systems
(IPDPS) (Washington, DC, USA), 2001, pp. 50-59.
220
__ji;~__~____l_/__~lj/j ilii/ /~ l ~ il_;__~;;___;_i;____ _~_ l ilillllll~iii ii iii~ii-----l-_--l----i~-:ii -_1_---
[CCL00] Bradford L. Chamberlain, Sung-Eun Choi, E Christopher Lewis, Calvin Lin,
Lawrence Snyder, and W. Derrick Weathersby, ZPL: A machine independent pro-
gramming language for parallel computers, IEEE Transactions on Software Engi-
neering 26 (2000), no. 3, 197-211.
[CL05] David Chase and Yossi Lev, Dynamic circular work-stealing deque, Proceedings of
the ACM symposium on Parallelism in Algorithms and Architectures (SPAA) (New
York, NY, USA), 2005, pp. 21-28.
[CLRS01] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein, In-
troduction to algorithms, second ed., The MIT Press and McGraw-Hill, 2001.
[CMC+07] Brian D. Carlstrom, Austen McDonald, Michael Carbin, Christos Kozyrakis, and
Kunle Olukotun, Transactional collection classes, Proceedings of the ACM SIG-
PLAN Symposium on Principles and Practices of Parallel Programming (PPoPP)
(New York, NY, USA), 2007, pp. 56-67.
[Dav73] C.T. Davies, Recovery semantics for a DB/DC system, ACM National Conference,
1973, pp. 136-141.
[DD96] Xiaotie Deng and Patrick Dymond, On multiprocessor system scheduling, Proceed-
ings of the ACM Symposium on Pararallel Algorithms and Architectures (SPAA),
1996, pp. 82-88.
[DES99] DESMO-J: A framework for discrete-event modelling and simulation,
http: //asi-www. inf ormatik. uni-hamburg. de/desmoj /, 1999.
[DFL+06] Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, and
Daniel Nussbaum, Hybrid transactional memory, Proceedings of International Con-
ference on Architectural Support for Programming Languages and Operating Sys-
tems (ASPLOS), October 2006.
[DGBL96] Xiaotie Deng, Nian Gu, Tim Brecht, and KaiCheng Lu, Preemptive scheduling of
parallel jobs on multiprocessors, Proceedings of the Symposium on Discrete Algo-
rithms (SODA), 1996, pp. 159-167.
[DLLO5] John Danaher, I-Ting Angelina Lee, and Charles E. Leiserson, Exception handling in
JCilk, Synchronization and Concurrency in Object-Oriented Languages (SCOOL),
October 2005, Available at http: //hdl . handle. net/1802/2 0 95.
[Dow98] Allen B. Downey, A parallel workload model and its implications for processor al-
location, Cluster Computing 1 (1998), no. 1, 133-145.
[DS87] P. Dietz and D. Sleator, Two algorithms for maintaining order in a list, Proceedings
of the Annual ACM Symposium on Theory of Computing (STOC) (New York City),
May 1987, pp. 365-372.
[DS09] Luke Dalessandro and Michael Scott, Strong isolation is a weak idea, Proceedings of
the Workshop on Transactional Memory (TRANSACT) (Raleigh, North Carolina),
2009.
221
[EAS+95] Guy Edjlali, Gagan Agrawal, Alan Sussman, Jim Humphries, and Joel Saltz, Com-
piler and runtime support for programming in adaptive parallel environments, Tech.
Report CS-TR-3510, University of Maryland, 1995.
[EASS94] Guy Edjlali, Gagan Agrawal, Alan Sussman, and Joel Saltz, Data parallel program-
ming in an adaptive environment, Tech. Report CS-TR-3350, University of Mary-
land, 1994.
[ECBD03] Jeff Edmonds, Donald D. Chinn, Timothy Brecht, and Xiaotie Deng, Non-clairvoyant
multiprocessor scheduling ofiobs with changing execution characteristics, Journal of
Scheduling 6 (2003), no. 3, 231-250.
[Edm99] Jeff Edmonds, Scheduling in the dark, Proceedings of Symposium on Theory of
Computing (STOC), 1999, pp.. 179-188.
[ESSO5] Kemal Ebcioglu, Vijay Saraswat, and Vivek Sarkar, X1O: An experimental language
for high productivity programming of scalable systems., Proceedings of Workshop
on Productivity and Performance in High-End Computing (P-PHEC), 2005, In con-
junction with Symposium on High Performance Computer Architecture (HPCA).
[EZL89] Derek L. Eager, John Zahorjan, and Edward D. Lozowska, Speedup versus efficiency
in parallel systems, IEEE Transactions on Computers 38 (1989), no. 3, 408-423.
[Fei] Dror Feitelson, Parallel workloads archive, http : / /www. cs . huj i . ac . il/ 1 abs /paral
[Fei96] Dror G. Feitelson, Packing schemes for gang scheduling, Proceedings of the Work-
shop on Job Scheduling Strategies on Parallel Processors (JSSPP) (Dror G. Feitelson
and Larry Rudolph, eds.), vol. 1162, Springer, 1996, pp. 89-110.
[Fei97] , Job scheduling in multiprogrammed parallel systems (extended version),
Tech. report, IBM Research Report RC 19790 (87657) 2nd Revision, 1997.
[FinO5] Jeremy T. Fineman, Provably good race detection that runs in parallel, Master's
thesis, Massachusetts Institute of Technology, Department of Electrical Engineering
and Computer Science, Cambridge, MA, August 2005.
[FL97] Mingdong Feng and Charles E. Leiserson, Efficient detection ofdeterminacy races in
Cilkprograms, Proceedings of the Annual ACM Symposium on Parallel Algorithms
and Architectures (SPAA) (Newport, Rhode Island), June22-25 1997, pp. 1-11.
[FL98] Matteo Frigo and Victor Luchangco, Computation-centric memory models, Pro-
ceedings of the Annual ACM Symposium on Parallel Algorithms and Architectures
(SPAA) (Puerto Vallarta, Mexico), 1998, pp. 240-249.
[FLPR99] Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran,
Cache-oblivious algorithms, Proceedings of the Annual Symposium on Foundations
of Computer Science (FOCS) (New York, New York), October 17-19 1999, pp. 285-
297.
222
[FLR98] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall, The implementation of
the Cilk-5 multithreaded language, Proceedings of the ACM SIGPLAN Conference
on Programming Language Design and Implementation (PLDI) (Montreal, Canada),
1998.
[FM87] Raphael Finkel and Udi Manber, DIB - A distributed implementation of backtrack-
ing, ACM Transactions on Programming Languages and Systems 9 (1987), no. 2,
235-256.
[For93] High Performance Fortran Forum, High performance fortran language specification
version 1.0, Tech. report, Rice University, 1993.
[Fri98] Matteo Frigo, The weakest reasonable memory model, Master's thesis, Massachusetts
Institute of Technology Department of Electrical Engineering and Computer Science,
January 1998.
[FTYZ90] Zhixi Fang, Peiyi Tang, Pen-Chung Yew, and Chuan-Qi Zhu, Dynamic processor
self-scheduling for general parallel nested loops, IEEE Transactions on Computers
39 (1990), no. 7, 919-929.
[GJSBOO] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha, The java language specifica-
tion, second ed., Addison Wesley, 2000.
[GK08] Rachid Guerraoui and Michal Kapalka, On the correctness of transactional mem-
ory, Proceedings of the ACM SIGPLAN Symposium on Principles and Practices of
Parallel Programming (PPoPP) (New York, NY, USA), ACM, 2008, pp. 175-184.
[GR93] Jim Gray and Andreas Reuter, Transaction processing: Concepts and techniques,
Morgan Kaufmann, 1993.
[Gra69] R. L. Graham, Bounds on multiprocessing timing anomalies, SIAM Journal on Ap-
plied Mathematics 17 (1969), no. 2, 416-429.
[Gra8 1] Jim Gray, The transaction concept: Virtues and limitations, Proceedings of the In-
ternational Conference of Very Large Databases (VLDB), September 1981, pp. 144-
154.
[Gu95] Nian Gu, Competitive analysis of dynamic processor allocation strategies, Master's
thesis, York University, 1995.
[Hal84] Robert H. Halstead, Jr., Implementation of Multilisp: Lisp on a multiprocessor, Pro-
ceedings of the ACM Symposium on Lisp and Functional Programming (Austin,
Texas), August 1984, pp. 9-17.
[HB99] Mor Harchol-Balter, The effect of' heavy-tailed job size. distributions on computer
system design, Conference on Applications of Heavy Tailed Distributions in Eco-
nomics, 1999.
223
[HBD97] Mor Harchol-Balter and Allen B. Downey, Exploiting process lifetime distributions
for dynamic load balancing, ACM Transactions on Computer Systems 15 (1997),
no. 3, 253-285.
[HHLO6] Yuxiong He, Wen Jing Hsu, and Charles E. Leiserson, Provably efficient two-level
adaptive scheduling, Proceedings of the Workshop on Job Scheduling Strategies for
Parallel Processing (JSSPP) (Saint-Malo, France), 2006.
[HHL07] , Provably efficient online non-clairvoyant adaptive scheduling, Proceedings
of the International Conference on Parallel and Distribued Systems(IPDPS) (Long
Beach, California, USA), March 2007.
[HK08] Maurice Herlihy and Eric Koskinen, Transactional boosting: a methodology for
highly-concurrent transactional objects, Proceedings of ACM SIGPLAN Sympo-
sium on Principles and Practices of Parallel Programming (PPoPP) (New York, NY,
USA), ACM, Feb 2008, pp. 207-216.
[HLM03] Maurice P. Herlihy, Victor Luchangco, and Mark Moir, Obstruction-free synchro-
nization: Double-ended queues as an example, Proceedings of the IEEE Interna-
tional Conference on Distributed Computing Systems (ICDCS) (Providence, Rhode
Island), May 2003, pp. 522-529.
[HLMS06] Danny Hendler, Yossi Lev, Mark Moir, and Nir Shavit, A dynamic-sized nonblocking
work stealing deque, Distributed Computing 18 (2006), no. 3, 189-207.
[HM] Maurice Herlihy and J. Eliot B. Moss, Transactional memory: Architectural support
for lock-free data structures,, Proceedings of the International Conference on Com-
puter Architecture (ISCA) (San Diego, CA), pp. 289-300.
[HS91] S. F. Hummel and E. Schonberg, Low-overhead scheduling of nested parallelism,
IBM Journal of Research and Development 35 (1991), no. 5-6, 743-765.
[HS02] Danny Hendler and Nir Shavit, Non-blocking steal-halfwork queues, Proceedings of
the Annual Symposium on Principles of Distributed Computing (PODC) (New York,
NY, USA), ACM, 2002, pp. 280-289.
[HWC+04] Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis,
Ben Hertzberg, Manohar K. Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle
Olukotun, Transactional memory coherence and consistency, Proceedings of the An-
nual International Symposium on Computer Architecture (ISCA), 2004, pp. 102-
113.
[HZJ94] Michael Halbherr, Yuli Zhou, and Chris F. Joerg, M[MD-style parallel programming
with continuation-passing threads, Proceedings of the International Workshop on
Massive Parallelism: Hardware, Software, and Applications (Capri, Italy), Septem-
ber 1994.
224
;______~__~____~r___r__;(_ll__i~~^?___jy * -__ii~:;_ii.i iti_:;ii(ii;-i-; ~ i^-l-::i: f---:l- i --(--t;~ -L___~l^ll:_ .Xl.i-.i^-.-ii(__i-iIli:*.i_ ijl .j ii ~
[Ins] Institute of Electrical and Electronic Engineers, Information technology - Portable
Operating System Interface (POSIX) - Part 1: System application program interface
(API) [C language], IEEE Standard 1003.1, 1996 Edition.
[IR94] Amos Israeli and Lihu Rappoport, Disjoint-access-parallel implementations of
strong shared memoiy primitives, Proceedings of the Annual ACM Symposium on
Principles of Distributed Computing (PODC) (New York, NY, USA), ACM, 1994,
pp. 151-160.
[KCJ+06] Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kundu, and Anthony
Nguyen, Hybrid transactional memory, Proceedings of the ACM SIGPLAM Sym-
posium on Principles and Practices of Parallel Programming (PPoPP) (New York,
NY), March 2006.
[KKP 81] D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe, Dependence graphs
and compiler optimizations, Proceedings of the ACM SIGPLAN-SIGACT Sympo-
sium on Principles of Programming Languages (POPL) (New York, NY, USA), ACM
Press, 1981, pp. 207-218.
[KR03] Matthias Korch and Thomas Rauber, A comparison of task pools for dynamic load
balancing of irregular algorithms, Concurrency and Computation: Practice & Expe-
rience 16 (2003), no. 1, 1-47.
[KZ88] Richard M. Karp and Yanjun Zhang, A randomized parallel branch-and-boundpro-
cedure, Proceedings of the Twentieth Annual ACM Symposium on Theory of Com-
puting (Chicago, Illinois), May 1988, pp. 290-300.
[LAK03] Sung-Chae Lim, Joonseon Ahn, and Myoung Ho Kim, A concurrent Blink-tree algo-
rithm using a cooperative locking protocol, Lecture Notes in Computer Science, vol.
2712, Springer Berlin / Heidelberg, 2003, pp. 253-260.
[Lam79] Leslie Lamport, How to make a multiprocessor computer that correctly executes mul-
tiprocess programs, IEEE Transactions on Computers C-28 (1979), no. 9, 690-691.
[LF03] Uri Lublin and Dror G. Feitelson, The workload on parallel supercomputers: Mod-
eling the characteristics of rigid jobs, Journal of Parallel and Distributed Computing
63 (2003), no. 11, 1105-1122.
[Lib06] Ben Liblit, An operational semantics for LogTM, Tech. report, Department of Com-
puter Sciences, University of Wisconsin-Madison, August 2006.
[LLS07] Malcolm Yoke Hean Low, Weiguo Liu, and Bertil Schmidt, A parallel BSP algorithm
for irregular dynamic programming, Proceedings of the International Symposium on
Advanced Parallel Processing Technologies, 2007, pp. 151-160.
[LO86] Will Leland and Teunis J. Ott, Load-balancing heuristics andprocess behavior, SIG-
METRICS (New York, NY, USA), 1986, pp. 54-69.
225
[LV90] Scott T. Leutenegger and Mary K. Vernon, The performance of multiprogrammed
multiprocessor scheduling policies, Proceedings of the 1990 ACM SIGMETRICS
Conference on Measurement and Modeling of Computer Systems (Boulder, Col-
orado), May 1990.
[MBM+06] K.E. Moore, J. Bobba, M.J. Moravan, M.D. Hill, and D.A. Wood, LogTM: Log-
based transactional memory, Proceedings of the Symposium on High Performance
Computer Architecture (HPCA), Feb 2006.
[MBS+08] Vijay Menon, Steven Balensiefer, Tatiana Shpeisman, Ali-Reza Adl-Tabatabai,
Richard L. Hudson, Bratin Saha, and Adam Welc, Practical weak-atomicity seman-
tics forjava stm, Proceedings of the Annual Symposium on Parallelism in Algorithms
and Architectures (SPAA) (New York, NY, USA), ACM, 2008, pp. 314-325.
[MCC+06] Austen McDonald, JaeWoong Chung, Brian D. Carlstrom, Chi Cao Minh, Hassan
Chafi, Christos Kozyrakis, and Kunle Olukotun, Architectural semanticsforpractical
transactional memory, Proceedings of the International Symposium on Computer
Architecture (ISCA), June 2006.
[MCN+00] Xavier Martorell, Julita Corban, Dimitrios S. Nikolopoulos, Nacho Navarro, Eleft-
herios D. Polychronopoulos, Theodore S. Papatheodorou, and Jesis Labarta, A tool
to schedule parallel applications on multiprocessors: the NANOS CPU manager,
Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing
(JSSPP) (Cancun, Mexico) (Dror G. Feitelson and Larry Rudolph, eds.), Springer-
Verlag, May 2000, Lecture Notes in Computer Science Vol. 1911, pp. 87-112.
[MEB88] Shikharesh Majumdar, Derek L. Eager, and Richard B. Bunt, Scheduling in multipro-
grammedparallel systems, Proceedings of the ACM SIGMETRICS Conference on
Measurement and Modeling of Computer Systems (Santa Fe, New Mexico, United
States), May 1988, pp. 104-113.
[MG08] Katherine F. Moore and Dan Grossman, High-level small-step operational seman-
tics for transactions, Proceedings of the ACM SIGPLAN-SIGACT Symposium on
Principles of Programming Languages (POPL) (New York, NY, USA), ACM, 2008.
[MH05] J. Eliot B. Moss and Antony L. Hosking, Nested transactional memory: Model and
preliminary architecture sketches, Proceedings of the Workshop on Synchronization
and Concurrency in Object-Oriented Languages (SCOOL) (San Diego, California),
October 2005.
[MH06] J. Eliot B. Moss and Antony L. Hosking, Nested transactional memory: Model and
architecture sketches, Science of Computer Programming, vol. 63, Elsevier, Dec
2006, pp. 186-201.
[MKHJ90] Eric Mohr, David A. Kranz, Robert H. Halstead, and Jr., Lazy task creation: A tech-
nique for increasing the granularity ofparallelprograms, IEEE Transactions on Par-
allel and Distributed Systems 2 (1990), 185-197.
226
_ _ __ l~^_ ;l-- ------ -- ... .....__ -l-~i i_- -i-_(l~ ~ _.iiil ;llil.i~:i.i ;:~-ii~--lii-~:-- _ .-~~~--  l iii- _t
[Mos85] J. Eliot B. Moss, Nested transactions: An approach to reliable distributed computing,
MIT Press, Cambridge, MA, USA, 1985.
[Mos06] J. Eliot B Moss, Open nested transactions: Semantics and support, Proceedings of
the Workshop on Memory Performance Issues (Austin, Texas), Feb 2006.
[MPT93] Rajeev Motwani, Steven Phillips, and Eric Torng, Non-clairvoyant scheduling, Pro-
ceedings of the Symposium on Discrete Algorithms (SODA), 1993, pp. 422-431.
[MR95] Rajeev Motwani and Prabhakar Raghavan, Randomized algorithms, Cambridge Uni-
versity Press, Cambridge, England, June 1995.
[MSH+06] Virendra J. Marathe, Michael F. Spear, Christopher Heriot, Athul Acharya, David
Eisenstat, William N. Scherer III, and Michael L. Scott, Lowering the overhead of
nonblocking software transactional memory, Proceedings of the Workshop of Lan-
guages, Compilers, and Hardware Support for Transactional Computing (TRANS-
ACT), June 2006.
[MSS05] Virendra J. Marathe, William N. Scherer III, and Michael L. Scott, Adaptive software
transactional memory, Proceedings of the International Symposium on Distributed
Computing (DISC) (Cracow, Poland), Sep 2005, Earlier but expanded version avail-
able as TR 868, University of Rochester Computer Science Dept., May2005.
[MVZ93] Cathy McCann, Raj Vaswani, and John Zahorjan, A dynamic processor allocation
policy for multiprogrammed shared-memory multiprocessors, ACM Transactions on
Computer Systems 11 (1993), no. 2, 146-178.
[NB99] Girija J. Narlikar and Guy E. Blelloch, Space-efficient scheduling of nested paral-
lelism, ACM Transactions on Programming Languages and Systems (TOPLAS) 21
(1999), no. 1, 138-173.
[NMAT+07] Yang Ni, Vijay Menon, Ali-Reza Adl-Tabatabai, Antony L. Hosking, Richard L.
Hudson, J. Eliot B. Moss, Bratin Saha, and Tatiana Shpeisman, Open nesting in soft-
ware transactional memory, Proceedings of ACM SIGPLAN Symposium on Princi-
ples and Practices of Parallel Programming (PPoPP), March 2007.
[NSS93] V. K. Naik, M. S. Squillante, and S. K. Setia, Performance analysis ofjob scheduling
policies in parallel supercomputing environments, Proceedings of the ACM/IEEE
conference on Supercomputing (Portland, Oregon), November 1993, pp. 824-833.
[Pap79] Christos H. Papadimitriou, The serializability of concurrent database updates, Jour-
nal of the ACM 26 (1979), no. 4, 631-653.
[Ree78] David Patrick Reed, Naming and synchronization in a decentralized computer sys-
tem, Tech. Report MIT/LCS/TR-205, Massachusetts Institute of Technology Labora-
tory for Computer Science, September 1978.
[Rei07] James Reinders, Intel threading building blocks: Outfitting c++ for multi-core pro-
cessor parallelism, O'Reilly, 2007.
227
[RSAU91] Larry Rudolph, Miriam Slivkin-Allalouf, and Eli Upfal, A simple load balancing
scheme for task allocation in parallel machines, Proceedings of the Annual ACM
Symposium on Parallel Algorithms and Architectures (SPAA) (Hilton Head, South
Carolina), July 1991, pp. 237-245.
[RSD+94] Emilia Rosti, Evgenia Smirni,, Lawrence W. Dowdy, Giuseppe Serazzi, and Brian M.
Carlson, Robust partitioning schemes of multiprocessor systems, Performance Eval-
uation 19 (1994), no. 2-3, 141-165.
[RSSD95] Emilia Rosti, Evgenia Smirni, Giuseppe Serazzi, and Lawrence W. Dowdy, Analysis
of non-work-conserving processor partitioning policies, Proceedings of the Interna-
tional Parallel Processing Symposium (IPPS), 1995, pp. 165-181.
[RW08] R. Raman and D.S. Wise, Comerting to andfrom dilated integers, IEEE Transactions
on Computers 57 (2008), no.. 4,, 567-573.
[SATG+09] Tatiana Shpeisman, Ali-Reza Adl-Tabatabai, Robert Geva, Yang ni, and Adam Welc,
Towards transactional memory semantics for C++, Proceedings of the Symposium
on Parallelism in Algorithms and Architectures (SPAA) (Calgary, Canada), 2009.
[SATH 06] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao Minh, and Ben-
jamin Hertzberg, McRT-STM: A high performance software transactional memory
system for a multi-core runtime, Proceedings of ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming (PPoPP), March 2006, pp. 187-197.
[Sco06] Michael L. Scott, Sequential specification of transactional memory semantics, Pro-
ceedings of the ACM SIGPLAN Workshop on Languages, Compilers, and Hardware
Support for Transactional Computing (TRANSACT), June 2006.
[SDMS08] Michael F. Spear, Luke Dalessandro, Virendra J. Marathe, and Michael L. Scott,
Ordering-based semantics for software transactional memory, Proceedings of the
International Conference on Principles of Distributed Systems (OPODIS) (Berlin,
Heidelberg), Springer-Verlag, 2008, pp. 275-294.
[Sen04] Siddhartha Sen, Dynamic processor allocation for adaptively parallel work-stealing
jobs, Master's thesis, Massachusetts Institute of Technology, Cambridge, MA,
September 2004.
[Sev94] K. C. Sevcik, Application scheduling and processor allocation in multiprogrammed
parallelprocessing systems, Performance Evaluation 19 (1994), no. 2-3, 107-140.
[Son98] Bin Song, Scheduling adaptively parallel jobs, Master's thesis, Massachusetts In-
stitute of Technology Department of Electrical Engineering and Computer Science,
Cambridge, Massachusetts, January 1998.
[Squ95] Mark S. Squillante, On the benefits and limitations ofdynamic partitioning in parallel
computer systems, Proceedings of the Workshop on Job Scheduling Strategies for
Parallel Processing (JSSPP) (Santa Barbara, California) (Dror G. Feitelson and Larry
228
-- --------_ __;^_
Rudolph, eds.), Springer-Verlag, April 1995, Lecture Notes in Computer Science Vol.
949, pp. 219-238.
[ST94] Dan Suciu and Val Tannen, Efficient compilation of high-level data parallel algo-
rithms, Proceedings of the ACM Symposium of Parallel Algorithms and Archite-
crures (SPAA), 1994, pp. 57-66.
[Sup03] Supercomputing Technologies Group, MIT Laboratory for Computer Science,
Cilk 5.4.6 reference manual, http: / supertech. csail.mit .edu/cilk/,
2003.
[Sup06] Supercomputing Technologies Group, Massachusetts Institute of Technology Labo-
ratory for Computer Science, Cilk 5.4.2.3 reference manual, April 2006.
[SW81] T. F. Smith and M. S. Waterman, Identification of common molecular subsequences.,
Journal of Molecular Biology 147 (1981), 195-197.
[TBB96] Kaushik Guha Timothy B. Brecht, Usingparallelprogram characteristics in dynamic
processor allocation policies, Performance Evaluation 27-28 (1996), 519-539.
[TG89] Andrew Tucker and Anoop Gupta, Process control and scheduling issues for multi-
programmed shared-memory multiprocessors, Proceedings of the ACM Symposium
on Operating Systems Principles (SOSP) (Litchfield Park, Arizona), December 1989,
pp. 159-166.
[TLW94] John Turek, Walter Ludwig, Joel L. Wolf, Lisa Fleischer, Prasoon Tiwari, Jason
Glasgow, Uwe Schwiegelshohn, and Philip S. Yu, Scheduling parallelizable tasks
to minimize average response time, Proceedings of the ACM Symposium on Parallel
Algorithms and Architectures (SPAA), 1994, pp. 200-209.
[Tra83] I.L. Traiger, Trends in systems aspects of database management, International Con-
ference on Databases, Wiley Heyden Ltd, 1983, pp. 1-21.
[TSP92] John Turek, Dennis Shasha, and Sundeep Prakash, Locking without blocking: making
lock based concurrent data structure algorithms nonblocking, Proceedings of the
eleventh ACM SIGACT-SIGMOD-SIGART symposium on Principles of Database
Systems (PODS) (New York, NY, USA), ACM, 1992, pp. 212-222.
[Val08] Leslie G. Valiant, A bridging model for multi-core computing, Proceedings of the
European Symposium on Algorithms (ESA) (Berlin, Heidelberg), 2008, pp. 13-28.
[Wei86] Gerhard Weikum, A theoretical foundation of multi-level concurrency control, Pro-
ceedings of the ACM SIGACT-SIGMOD Symposium on Principles of database sys-
tems (PODS) (New York, NY, USA), ACM Press, 1986, pp. 31-43.
[WF99] David S. Wise and Jeremy D. Frens, Morton-order matrices deserve compilers 'sup-
port, Tech. Report TR533, Indiana University, 1999.
229
[WS92] Gerhard Weikum and Hans-Jorg Schek, Concepts and applications of multilevel
transactions and open nested transactions, Database Transaction Models for Ad-
vanced Applications, Morgan Kaufmann, San Francisco, CA, USA, 1992, pp. 515-
553.
[YL01] K. K. Yue and D. J. Lilja, Implementing a dynamic processor allocation policy for
multiprogrammed parallel applications in the SolarisTMoperating system, Concur-
rency and Computation-Practice and Experience 13 (2001), no. 6, 449-464.
[Zip49] George K. Zipf, Human behavior and the principle of least effort, Addison-Wesley,
1949.
230
(_;:__1;__1_^~_~_~~_~__;~iiC__l_^ji__ri -~~-----~ll~i,_-ll-lX1-(l_~--~. _ - 1 . __,II_1I:1 - 1liiil-i;_ - ;::i-~- - f:::-~i- _ _1~- ~l(i-- l~ i-^-----i--,il_.--_i~i;llii;li  i:i ~i-j:.C: i- ;i-- I,_.li -_- _-l-ll-X___X -I *ii-i r
