Consistent state software transactional memory by Cunha, Gonçalo Vasco Trincão Bento da
Universidade Nova de Lisboa
Faculdade de Cieˆncias e Tecnologia
Departamento de Informa´tica
Consistent State Software
Transactional Memory
Gonc¸alo Vasco Trinca˜o Bento da Cunha
Dissertac¸a˜o apresentada na Faculdade de
Cieˆncias e Tecnologia da Universidade
Nova de Lisboa para a obtenc¸a˜o do Grau
de Mestre em Engenharia Informa´tica.
Lisboa
(2007)
This dissertation was prepared under the supervision of
Professor Joa˜o Lourenc¸o,
of the Faculdade de Cieˆncias e Tecnologia,
Universidade Nova de Lisboa.
ii
To my mother, my father and my sister.
To Luzia.
[This page was intentionally left blank]
Acknowledgements
I would like to, first of all, show my appreciation to my supervisor, Professor Joa˜o
Lourenc¸o for his guidance and support. Even with his very tight time schedule, he
always found the time to discuss the problems and progress of the thesis. I also appre-
ciate his friendship and his remarkable human skills.
To Dave Dice, Ori Shalev and Nir Shavit who sent us the TL2 source code, which is
the base of this thesis.
To my good friends Ame´rico Rio and Jose´ Reis, with whom I have shared my ups
and downs during our usual coffee breaks and Friday lunches.
To my close family, my mother, father and sister, my very special thanks for their
love, support and confidence in me.
To Luzia, who always stood by me and gave me the strength to embrace this project.
v
[This page was intentionally left blank]
Summary
As the multicore CPUs start getting into everyone’s computers, concurrent program-
ming must start covering, not only the scientific and enterprise applications, but also
every computer application we all use in our daily lives.
Since the introduction of software transactional memory [ST95, HLMWNS03], this
topic has had a strong interest by the scientific community as it has the potential of
greatly facilitating concurrent programming by hiding the concurrency issues under
the transactional layer.
This thesis builds on the TL2 STM engine [DON06], which is one of the top perform-
ing to date. We have explored different design alternatives focusing on performance
and safety. With our research we have achieved performance improvements and better
safety properties of the engine. We have also achieved a much better understanding of
the design alternatives and their impacts.
During the course of this thesis we have come across several tough concurrency
bugs and we have created a list of testing patterns, which proved to be useful in finding
and reproducing several problems.
This thesis describes the cutting edge of STM engine technology, elaborates on the
design of a new STM engine and reports on the experimental results obtained.
vii
[This page was intentionally left blank]
Suma´rio
Agora que os processadores multicore comec¸am a ser comuns ate´ nos computadores
pessoais, a programac¸a˜o concorrente precisa de comec¸ar a contemplar, na˜o apenas as
aplicac¸o˜es cientı´ficas e empresariais, mas tambe´m todas os programas que usamos no
nosso quotidiano.
Desde a introduc¸a˜o da memo´ria transaccional por software [ST95, HLMWNS03],
este to´pico tem sido alvo de um forte interesse pela comunidade cientı´fica. Isto acon-
tece porque tem o potencial de facilitar consideravelmente o desenvolvimento de pro-
gramas concorrentes, escondendo as suas dificuldades na camada transaccional.
Esta tese baseia-se no motor transaccional TL2 [DON06], que e´ um dos proto´tipos
de investigac¸a˜o com melhor performance ate´ a` data. Durante este estudo explora´mos
arquitecturas diferentes para o TL2, com o foco na sua velocidade de execuc¸a˜o e pro-
priedades de seguranc¸a, conseguindo a estes nı´veis melhoramentos em relac¸a˜o ao
proto´tipo original. Obtivemos igualmente um entendimento mais detalhado das alter-
nativas de implementac¸a˜o dos motores transaccionais, bem como dos seus impactos a
nı´vel de funcionalidade e desempenho.
Durante a investigac¸a˜o depara´mo-nos com va´rios problemas de concorreˆncia com-
plexos, o que nos levou a sintetizar uma lista de padro˜es de teste que provaram ser
u´teis na tarefa de encontrar ainda mais problemas no proto´tipo.
O presente trabalho expo˜e alguns dos to´picos do que se faz hoje em dia no que diz
respeito a` investigac¸a˜o em sistemas de memo´ria transaccional, descreve em detalhe as
alterac¸o˜es que fizemos ao TL2 e apresenta os resultados experimentais obtidos.
ix
[This page was intentionally left blank]
Contents
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Transactional Model . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Transactional Memory . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Outline of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Software Transactional Memory 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Conditional waiting . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Composability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Composing alternatives in transactions . . . . . . . . . . . . . . . 11
2.2.4 Transaction nesting . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.5 Irrevocable actions . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Design approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Blocking vs. Non-Blocking Synchronization Techniques . . . . . 14
2.3.2 Transactional data and granularity . . . . . . . . . . . . . . . . . . 16
2.3.3 Concurrency Control . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.4 Recovery Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.5 Transaction States and Global Version Clock Algorithm . . . . . . 20
2.3.6 Lock Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.7 Contention Management . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.8 Support for Long Running Transactions . . . . . . . . . . . . . . . 26
2.4 A Bit of History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Prototype Development 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Original TL2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
xi
CONTENTS
3.2.1 Lock Structure and Placement . . . . . . . . . . . . . . . . . . . . 30
3.2.2 Word Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Redo Log Strategy with Consistent State Validation . . . . . . . . 31
3.2.4 Contention Management . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 TL2 Port to X86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Solaris to Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.2 SPARC to X86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.3 64 bit to 32 bit architecture . . . . . . . . . . . . . . . . . . . . . . . 34
3.4 New Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.1 User Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.2 Automatic Transaction Retry . . . . . . . . . . . . . . . . . . . . . 35
3.4.3 Transaction Nesting . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Performance Related Changes . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5.1 Undo Log Strategy with Consistent State Validation . . . . . . . . 38
3.5.2 Object Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5.3 Full Validation, Partial Validation and No Validation . . . . . . . 40
3.5.4 Reducing TxLoad Overhead . . . . . . . . . . . . . . . . . . . . . 42
3.5.5 Block Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.6 Lock Adjacent to Data . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6 Safety Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.6.1 New Quiesce Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 Debugging enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7.1 Event Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4 Testing STM Implementations 53
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Sample of Bugs Found . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1 Bug 1: Reference to Non-Transactional Memory . . . . . . . . . . 57
4.3.2 Bug 2: Lost Update with Lock Collision . . . . . . . . . . . . . . . 58
4.3.3 Bug 3: Dirty Read Not Invalidated when Transaction Aborts . . . 59
4.3.4 Bug 4: Lost Update on Lock Upgrade . . . . . . . . . . . . . . . . 60
4.4 Testing Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Very Short Transactions . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.2 High Frequency of Variables Being Added and Deleted . . . . . . 62
4.4.3 High Number of Updates on a Small Number of Variables . . . . 62
4.4.4 Small Lock Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.5 More Concurrent Transactions than CPUs . . . . . . . . . . . . . . 63
xii
CONTENTS
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5 Prototype Validation 65
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Description of the Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2.1 Test Harness Implementation . . . . . . . . . . . . . . . . . . . . . 67
5.2.2 Test Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3 Test Harness Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 Test Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.1 Comparing Undo/Redo, Word/Object modes . . . . . . . . . . . 73
5.4.2 The cost of consistent state validation . . . . . . . . . . . . . . . . 75
5.4.3 Lock adjacent to the data . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.4 Different Block Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.5 Comparing the performance against the ported TL2 . . . . . . . . 86
5.4.6 Comparing the performance against Ennals STM . . . . . . . . . 88
5.4.7 Cache coherency problems . . . . . . . . . . . . . . . . . . . . . . 90
5.4.8 STM engine overhead . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.9 Speedup Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6 Conclusions 97
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A Raw test data 101
A.1 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.2 Raw Data of the Red Black Tree Tests . . . . . . . . . . . . . . . . . . . . . 104
A.3 Raw Data of the Sorted List Tests . . . . . . . . . . . . . . . . . . . . . . . 111
B STM Engine API 115
B.1 STM engine data structures . . . . . . . . . . . . . . . . . . . . . . . . . . 116
B.2 API for transaction initiation and control . . . . . . . . . . . . . . . . . . 116
B.3 API for loading and storing data in word based mode . . . . . . . . . . . 117
B.4 API for loading and storing data in object based mode . . . . . . . . . . . 117
xiii
[This page was intentionally left blank]
List of Figures
1.1 Simple Transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Transaction that may abort . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Transaction defined by an atomic block . . . . . . . . . . . . . . . . . . . 8
2.2 Transaction defined by a function or method . . . . . . . . . . . . . . . . 9
2.3 One-way Bridge using condition variables . . . . . . . . . . . . . . . . . . 9
2.4 One-way Bridge using atomic blocks with guards . . . . . . . . . . . . . 10
2.5 One-way Bridge using atomic blocks with retry primitive . . . . . . . . . 10
2.6 Composing alternatives in transactions . . . . . . . . . . . . . . . . . . . 12
2.7 Irrevocable action inside transaction . . . . . . . . . . . . . . . . . . . . . 14
2.8 DSTM Object Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 Versioned write lock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.10 Transaction sees inconsistent state. . . . . . . . . . . . . . . . . . . . . . . 20
2.11 Transaction in consistent but obsolete state. . . . . . . . . . . . . . . . . . 21
2.12 Infinite loop caused by improper ordering. Transaction T1 is inconsistent. 22
2.13 Memory Corruption when accessing indexed data. . . . . . . . . . . . . . 23
2.14 False positive with the global version clock algorithm. . . . . . . . . . . . 24
2.15 Sample list implementation using adjacent locks . . . . . . . . . . . . . . 25
2.16 STM History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Lock Table in TL2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Simplified API of TL2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Consistent state validation in Redo Log/Word based mode . . . . . . . . 32
3.4 User specified abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Control flow after abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Non retrying transaction placed inside a while loop . . . . . . . . . . . . 36
3.7 Non retrying transaction placed inside a while loop . . . . . . . . . . . . 36
3.8 Simplified API for handling objects in object based mode . . . . . . . . . 39
3.9 Full state validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.10 Partial state validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
xv
LIST OF FIGURES
3.11 No state validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.12 Consistent state validation for locked variables . . . . . . . . . . . . . . . 42
3.13 Optimized TxLoad algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.14 Redo log mode: Reference to Non-Transactional Memory. . . . . . . . . . 44
3.15 Redo log mode: Reference to Non-Transactional Memory—Corrected
version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.16 Creating and releasing a transactional variable (simplified) . . . . . . . . 47
3.17 API and sample usage of the Tracing Engine . . . . . . . . . . . . . . . . 50
4.1 Transaction Construct Glossary—Transactional Level . . . . . . . . . . . 54
4.2 Transaction Construct Glossary—Lock and Data Level . . . . . . . . . . 55
4.3 Simplified decomposition of transactional- into lock-level operations in
undo- and redo log mode STMs . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Undo/Redo log mode: Reference to Non-Transactional Memory. . . . . 57
4.5 Lock acquisition in redo log mode—buggy version . . . . . . . . . . . . . 58
4.6 Redo log mode: Lost Update with Small Lock Table . . . . . . . . . . . . 59
4.7 Lock acquisition in redo log mode—correct version . . . . . . . . . . . . 59
4.8 Undo log mode: Dirty read not invalidated when transaction aborts . . . 60
4.9 Undo log mode: Lost update on lock upgrade . . . . . . . . . . . . . . . 61
5.1 Percentage of time spent in the harness . . . . . . . . . . . . . . . . . . . 70
5.2 Test Harness Launcher Overhead . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Evaluation of implementation alternatives - Red Black Tree . . . . . . . . 74
5.4 Evaluation of implementation alternatives - Sorted List . . . . . . . . . . 76
5.5 Consistent state validation alternatives - Red Black Tree . . . . . . . . . . 78
5.6 Abort Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.7 Cost of validation - Sorted List . . . . . . . . . . . . . . . . . . . . . . . . 80
5.8 Adjacent Lock vs Lock Table - Red Black Tree . . . . . . . . . . . . . . . . 82
5.9 Adjacent Lock vs Lock Table - Sorted List . . . . . . . . . . . . . . . . . . 83
5.10 Comparing different block sizes. . . . . . . . . . . . . . . . . . . . . . . . 85
5.11 Ported TL2 vs improved prototype . . . . . . . . . . . . . . . . . . . . . . 87
5.12 Ennals vs improved prototype . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.13 High cache coherency traffic on the bus . . . . . . . . . . . . . . . . . . . 91
5.14 STM engine overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.15 STM engine speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.1 Test results keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.2 STM engine running modes . . . . . . . . . . . . . . . . . . . . . . . . . . 103
xvi
Chapter 1
Introduction
This Chapter shows the motivation for transactional memory, highlights the main con-
tributions of the thesis and outlines the structure of this work.
1
1. INTRODUCTION 1.1. Introduction
1.1 Introduction
On the past decades, the density of transistors in computer chips has grown in an expo-
nential way. Gordon Moore, back in 1965, has observed that the number of transistors
per square inch doubles every 18 months and the rate of growth has been relatively
stable since then. For a long time the CPU clock frequency has accompanied this trend,
growing side by side with the number of transistors. However, since the appearance of
CPUs with GHz clock speed, the climb rate of clock frequency has been slowing down,
even though the number of transistors is still rising.
This situation where the number of transistors increases but the clock speed does
not, has left room for the CPU manufacturers to add more instruction sets (e.g. SSE, VT)
and to increase the number of cores each chip has. However, regular single threaded
applications can’t take advantage of the increasing number of cores and developers
must change the way applications are implemented to keep providing more complex
features without decreasing its overall performance.
To take advantage of this new reality, applications must be concurrent. There is,
however, a natural resistance from many programmers to create concurrent programs
because of the additional effort and complexity it brings [Sut05]. In part this resistance
is due to the lack of a suitable framework to deal with concurrency. The usual synchro-
nization constructs (locks and condition variables), while simple on the paper, may
become unpredictable and error prone in complex systems. Coarse grained locks on
large data structures do not scale and fine grained locks are prone to difficult problems
in larger systems, such as deadlocks and priority inversion. Also locks and condition
variables do not compose [HMPJH05]. Software Transactional Memory (STM) is gain-
ing popularity, as it helps to solve some of these problems. It can prevent deadlocks
and priority inversions and it is composable.
1.1.1 Transactional Model
Transactions are known for a long time on the database community. They have been
very well studied, have standard semantics and are a standard feature of database
management systems.
The transactional model is a good paradigm to deal with concurrency because it
can hide most of its complexity. With the transactional model, a programmer only
has to define which section of its code is a transaction. From a developer perspective,
the transactional model ensures that the section runs “as if” it was running alone in
the system. The transactional layer is responsible to manage the concurrency of the
several transactions and avoid concurrency hazards.
2
1. INTRODUCTION 1.1. Introduction
The transactional model defines four properties [Dat94]:
• Atomicity. The transactions are atomic, they have a “all or nothing effect”. Either
all its actions take effect or none of them does.
• Consistency. Transformations preserve consistency, meaning that the global state
moves from one consistent state to another.
• Isolation. Transactions are isolated from each other, even if they run concurrently.
For any two distinct transactions T1 and T2, T1 may see T2 updates, or T2 may
see T1 updates, but not both at the same time.
• Durability. After a transaction commits the changes will persist even if there is a
system crash.
A transactional system with these properties allows programmers to continue using
the sequential programming mental scheme with minimal changes. They basically
don’t have to care about concurrency control as the transactional model deals with it.
A sample of a transaction is shown on Figure 1.1. The transaction occurs between
BeginTransaction and Commit. After the commit returns successfully, the application
knows that all its effects have taken place. Before the commit occurs, the transaction
may still be undone.
If, instead of committing, the transaction had aborted (Figure 1.2), the application
would know that none of its updates had taken place.
1 BeginTransact ion ( )
2 DoAction1 ( )
3 DoAction2 ( )
4 DoAction3 ( )
5 CommitTransaction ( )
Figure 1.1: Simple Transaction
1 BeginTransact ion ( )
2 DoAction1 ( )
3 DoAction2 ( )
4 DoAction3 ( )
5 i f ( something )
6 CommitTransaction ( )
7 else
8 AbortTransact ion ( )
Figure 1.2: Transaction that may
abort
The transactional model ensures that: either all or none of DoAction1, DoAction2 and
DoAction3 are performed; the consistency properties of the shared state are preserved
after the transaction finishes; even if there are other concurrent transactions running,
the transaction runs as if it was alone on the system; and after the transaction commits,
its updates persist even if there is a system failure, in other words, the changes are
written to a persistent storage.
3
1. INTRODUCTION 1.1. Introduction
1.1.2 Transactional Memory
Transactional memory aims at replacing the traditional concurrency control mecha-
nisms used in most programming languages: locks, condition variables, semaphores,
etc, are among the most commonly used.
Naturally, the transactional memory and database requirements are different [HM-
PJH05], e.g., the consistency and durability properties are more relaxed and there is a
much stronger performance requirement on memory transactions.
The durability requirement implies that changes to the shared state have to be per-
sistent after the transaction commits, meaning they have to be written to a persistent
storage. This requirement is stronger than necessary for transactional memory (locks
and condition variables also don’t have this property) and it brings a big performance
penalty, therefore it has been removed from the transactional memory requirements.
The consistency requirement is well suited for databases, which have a standard
well known data organization—tables, rows, cells, etc. and have standard integrity
rules, like referential integrity. Memory transactions, on the other hand, can’t have
consistency verification in the same way as database transactions because the data
structures are much more flexible—the concepts of primary and foreign keys are not
used in memory structures. This requirement has been dropped from all STM engines
we know of.
To summarize, the main requirements for transactional memory are: Atomicity,
Isolation and Performance.
The need for performance has already made some authors propose to drop some
interesting features. Early STM engines [HLMWNS03,HF03] were implemented using
non-blocking synchronization [HLM03] and, as a consequence, they had the additional
feature of supporting arbitrary thread failures, however later tests with lock based
STMs ( [Enn06,DS06,DON06,SATH+06]) showed that the former were less performant.
So more and more STM proposals are based on blocking techniques.
Also, the performance requirement has led to the use of optimistic concurrency in
detriment of read/write locks. Optimistic concurrency allows transactions to run with-
out synchronization and, on commit, validate if the transaction conflicted with oth-
ers. However, transactions running with optimistic concurrency control, may exhibit
strange behaviors due to conflicts with other transactions (like dereferencing invalid
pointers), which must be detected and handled by the STM engine.
This work is based on the TL2 STM engine. TL2 is a top performing engine, with
interesting features, like the integration with standard non-garbage collected memory
languages such as C and the early detection of inconsistent states.
4
1. INTRODUCTION 1.2. Contributions of this Thesis
1.2 Contributions of this Thesis
The focus of this thesis is the STM engine technology. After an initial survey of the ex-
isting STM engines, we have extended TL2 by implementing new features; we have im-
plemented and evaluated several performance oriented design options; and we have
created testing tools to ease problem finding in the STM engine.
The features implemented in the STM engine can be summarized as:
• User Called Aborts — For the application to programmatically abort a transaction.
• Automatic Transaction Retry — To automatically retry a transaction that finds itself
in an inconsistent state. This changes the semantic of the commit from “at most
one” to “exactly one” successful execution.
• Transaction Nesting — To enable the partial rollback of a transaction and to in-
crease reusability of the software stack by allowing transactions to be started at
several layers.
Related to performance, we have made several changes to the original TL2. We can
outline:
• We have achieved a better performance by increasing the lock granularity to the
object level and thus reducing the overhead by having fewer lock acquisitions.
• We have decreased the overhead also by reducing the number of instructions
executed on the most used transactional method (load).
• We have introduced a novel modification to TL2’s algorithm to decrease the cache
coherency overhead by reducing the number of accesses to shared state.
Other contributions of this thesis were:
• We have increased the safety properties of the original STM engine. This novel
modification ensures that the transactional code always sees a consistent view of
memory. The original prototype did not ensure a consistent view when a piece
of allocated memory was leaving the transactional space (being free’d).
• We have built a lightweight tracing engine. A specialized lightweight tracing
engine is indispensable to debug the STM implementation because the lock gran-
ularity on a memory transaction is very fine grained and the number of interleav-
ings is huge, making it virtually impossible to use traditional debuggers to find
STM implementation bugs. The novel tracing engine built on this thesis records
5
1. INTRODUCTION 1.3. Outline of the Dissertation
events in order of occurrence and does it with an exclusive section of a single
CPU instruction.
• In addition to the changes made on the prototype, we have proposed a novel
classification scheme for the consistency of the transaction memory view.
• Finally, we are also the first to present a type of hazard that may occur on lock
based STMs that use the undo log strategy. This hazard may occur because trans-
actional writes may take place when a transaction is in an invalid state.
1.3 Outline of the Dissertation
This dissertation is divided in the following chapters:
• Chapter 1. This Chapter has shown the motivation for the use of transactional
memory, it has highlighted the main contributions of the work and outlined the
structure of this thesis.
• Chapter 2. This Chapter shows the main features associated with transactional
memory alongside with the main design approaches.
• Chapter 3. This Chapter describes the TL2 STM engine, in which this work is
based. It also describes in detail the improvements we to TL2 which originate
this thesis.
• Chapter 4. This Chapter describes some of the problems found while implement-
ing the changes in the STM engine and synthesizes the tests used to reproduce
those problems.
• Chapter 5. This Chapter describes the tests and experiments made to evaluate the
prototype and shows the achieved results.
• Chapter 6. This Chapter summarizes the main results of this investigation and
brings out some pointers for future directions.
6
Chapter 2
Software Transactional Memory
This Chapter shows the main features associated with transactional memory alongside
with the main design approaches.
7
2. SOFTWARE TRANSACTIONAL MEMORY 2.1. Introduction
2.1 Introduction
Transactional memory aims at providing a higher abstraction level for concurrent pro-
gramming. It aims at replacing locks and condition variables as the concurrency con-
trol mechanisms, providing a semantics which makes it easier for the programmers to
reason about. One of the most important properties provided by transactional mem-
ory is the serializable isolation level, where the concurrent events occur “as if” they
happened sequentially. Another very important property is atomicity, which means
that either all the effects of a transaction are visible (transaction commits) or none of
them are (transaction aborts).
Taking these properties as a starting point, there have been a flurry of design ap-
proaches and of proposed new features. In the following sections we describe some
of these new features proposed in the literature and we outline several design ap-
proaches, discussing their main advantages and drawbacks.
There have been several syntaxes used in STM prototypes. The first approach was
used in [HLMWNS03] and it uses methods defined in a library to control the trans-
action progress (Figure 1.1 of Section 1.1.1). The second approach was used in [HM-
PJH05] and uses a new keyword atomic to represent a transaction, the authors call it an
atomic block (Figure 2.1). The third approach used in [HLM06], places the transaction
code inside a function or method, the transaction function is called via a transaction
manager (Figure 2.2). 1
1 atomic{
2 x ++;
3 y++;
4 }
Figure 2.1: Transaction defined by an atomic block
Despite the syntax used, the guarantee that is always provided is: If the transac-
tion returns successfully, then all the actions of the transaction were executed and the
transaction did not interfere with other transactions. Otherwise, if the transaction did
not return successfully, then none of its actions have taken effect.
1The examples shown in this thesis will use pseudo-code in a C/Java like fashion.
8
2. SOFTWARE TRANSACTIONAL MEMORY 2.2. Features
1 i n t TransactionXPTO ( ) {
2 a=TxLoad ( x ) ;
3 b=TxLoad ( y ) ;
4 return 1 ;
5 }
6
7 / / . . .
8 Cal lTr a n sac t i on ( TransactionXPTO ) ;
Figure 2.2: Transaction defined by a function or method
2.2 Features
In addition to these basic features—atomicity and isolation, some others have been
proposed to ease the developers job and extend the number of situations where trans-
actional memory can be used.
2.2.1 Conditional waiting
The conditional waiting functionality is available when using condition variables and
it allows a thread to wait until a certain condition is true. This idea may as well be
applied to transactions [HMPJH05].
One example is crossing a one-way bridge, which allows for only one car at a time.
This example designed using condition variables is shown on Figure 2.3.
1 I n i t ( ) {
2 NumCars=0;
3 }
4
5 Synchronized EnterBridge ( ) {
6 while (NumCars >= 1) {
7 wait ( ) ;
8 }
9 NumCars++;
10 }
11
12 Synchronized LeaveBridge ( ) {
13 NumCars −−;
14 Not i fyAl l ( ) ;
15 }
Figure 2.3: One-way Bridge using condition variables
9
2. SOFTWARE TRANSACTIONAL MEMORY 2.2. Features
Plain transactions are not enough to implement efficiently the conditional wait
functionality, as they don’t have waiting primitives.
To encode this logic using transactions it is necessary to implement a waiting primi-
tive. Two proposals have shown up: one with guarded atomic blocks (Figure 2.4) used
in [HF03] and another with a new retry primitive (Figure 2.5) used in [HMPJH05].
When using a guarded atomic block, the execution is deferred until the condition
is true. When using the retry primitive, the transaction is executed as usual, but when
a retry primitive is called the transaction is rolled back and only restarted when any of
the variables read in the previous execution of the transaction have changed. Using the
example in Figure 2.5, if a transaction retries, it will be restarted only when the variable
NumCars has changed (by a commit of another transaction). The retry primitive is
more flexible than the guarded atomic blocks as the logic before a retry is issued can be
more complex than a simple expression, e.g., it is not easy to use atomic blocks with
guards to wait for all the elements of an array to have a specific value.
1 I n i t ( ) {
2 atomic{
3 NumCars=0;
4 }
5 }
6
7 EnterBridge ( ) {
8 atomic (NumCars<1){
9 NumCars++;
10 }
11 }
12
13
14
15 LeaveBridge ( ) {
16 atomic{
17 NumCars−−;
18 }
19 }
Figure 2.4: One-way Bridge
using atomic blocks with
guards
1 I n i t ( ) {
2 atomic{
3 NumCars=0;
4 }
5 }
6
7 EnterBridge ( ) {
8 atomic{
9 i f (NumCars<1)
10 NumCars++;
11 else
12 r e t r y ;
13 }
14 }
15
16 LeaveBridge ( ) {
17 atomic{
18 NumCars−−;
19 }
20 }
Figure 2.5: One-way Bridge
using atomic blocks with retry
primitive
10
2. SOFTWARE TRANSACTIONAL MEMORY 2.2. Features
2.2.2 Composability
Composability is an important property because it facilitates the creation of several
software layers, where the lower layers hide the complexity of their implementation
from the upper layers.
An example where locks are not composable is a concurrent list implementation
with three access methods: add, delete and get [HMPJH05]. These methods are inter-
nally synchronized and therefore enough to manage the list. However, if one appli-
cation has two list instances and needs to move an element from one list to the other,
without having visible the intermediate state where the element is in neither of the
lists, these three methods are not enough. The list implementation needs to expose the
lock and unlock methods to allow this operation. However, the abstraction is broken
as the list must expose its internal lock and unlock methods.
Memory transactions can aid in achieving better composability by using the exten-
sions described ahead.
2.2.3 Composing alternatives in transactions
Some situations require a way to express alternatives in transactions. Let’s say there
is already a library representing a transactional queue and a given thread can use the
send/receive API calls to interact with the queue. The receive method blocks when the
queue is empty.
Let’s suppose one given thread needs to wait for data on two or more transactional
queues. The problem is that since the receive method is blocking, if one queue is empty,
the thread will block and will not be able to receive from the other queue even if it has
data.
The basic requirements for this feature are: to be able to wait for several events;
the absence of active polling; and that they are composable, as we want to reuse the
existing receive method from the transactional library API.
The result as proposed by [HMPJH05] is a new primitive that expresses an alterna-
tive and it is called orElse. It is used in conjunction with the retry primitive and it allows
the transaction to execute an alternative path if the previous path did a retry. On the
example of Figure 2.6, the method ReceiveFromEither tries to receive from one of the
queues, if the first receive issues a retry, it rolls back and tries from the second queue.
If the second queue is also empty, the whole transaction is rolled back and restarted
when any of the variables read have changed.
11
2. SOFTWARE TRANSACTIONAL MEMORY 2.2. Features
1 Receive ( Queue q ){
2 atomic{
3 i f ( NumItems==0){
4 r e t r y ;
5 } else {
6 / / r e t u r n d a t a from queue
7 }
8 }
9 }
10
11 ReceiveFromEither ( Queue q1 , Queue q2 ){
12 atomic{
13 do {
14 d = Receive ( q1 ) ;
15 } orElse {
16 d = Receive ( q2 ) ;
17 }
18 }
19 }
Figure 2.6: Composing alternatives in transactions
2.2.4 Transaction nesting
A feature long known on databases is transaction nesting, which enables the creation
of sub-transactions (transactions inside other transactions).
The nesting feature is imperative to enable composition of transactions. When a
software system is composed of several layers; the lower layers need to be atomic;
and the upper layers need to re-use the operations form the lower layers, then the
transactions need to be composable. In other words, the upper layers are transactions
and need to reuse the lower layer operations, which are also transactions.
An example is a list implementation with the same three methods: add, delete and
get. The operation move, which transfers an element from one list to another, should
be composed of a delete followed by an add, but in an atomic step. As such, it needs
to start a transaction and, when it calls the methods delete and add, they also start a
(sub-)transaction.
There are three types of transaction nesting [ALS06]: flat nesting, closed nesting, open
nesting. Their main characteristics are:
• Flat Nesting — The effects of the sub-transactions are only visible to other trans-
actions when the main transaction commits. This mode works by “inlining” the
sub-transactions with the main transaction. When a sub-transaction aborts the
12
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
whole transaction is rolled back, including the main transaction.
• Closed Nesting — The effects of the sub-transactions are also only visible to other
transactions when the main transaction commits. In this mode, the updates of
the sub-transactions are recorded separately from the main transaction and they
only become part of the main transaction when the sub-transaction commits suc-
cessfully. When a sub-transaction aborts, only the effects of the sub-transaction
are undone.
• Open Nesting — The sub-transactions are considered to be independent of the
main transaction and their effects are visible to other transactions immediately
after the sub-transaction commits successfully. The aborts of the sub-transaction
don’t cause aborts on the main transaction and even if the main transaction aborts,
the sub-transaction effects will not be undone.
2.2.5 Irrevocable actions
Since transactions may abort at any time (by explicit call from the application or due
to internal engine issues) it is important that every action made within the transaction
can be reversible. However, some types of actions are not reversible, like writing to a
socket or deleting a file, consequently they should not be allowed. STM engines to date
are only capable of reversing memory operations, therefore I/O operations must not
be used inside a transaction, otherwise the operation may be executed multiple times,
or it may be executed even if the transactions aborts.
On the example of Figure 2.7, the PowerOffRemoteServer function, powers off some
server on the network. If the execution was allowed, and if the “some condition” was
false, the transaction would rollback, but the remote server would have already been
shutdown by the transaction.
It is not easy for an STM engine to prevent I/O operations (if not even impossible in
some programming languages), this is why most STM engines leave this responsibility
to the programmer. One exception is the STM engine STM Haskell [HMPJH05]. It
is implemented in the programming language Concurrent Haskell, which has a very
expressive type system and it splits all actions in STM actions and IO actions. The
compiler statically prevents IO actions from running inside memory transactions.
2.3 Design approaches
We now describe some concepts and design approaches which have been reported in
the literature. These are some of the fundamental building blocks of the STM engines—
13
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
1 Transact ionBegin ( ) ;
2 PowerOffRemoteServer ( ServerName ) ;
3 i f ( some condi t ion )
4 TransactionCommit ( ) ;
5 else
6 Transact ionAbort ( ) ;
Figure 2.7: Irrevocable action inside transaction
the synchronization techniques used, the granularity of the locks, the algorithms of the
transaction log, the contention management strategies, etc.
2.3.1 Blocking vs. Non-Blocking Synchronization Techniques
When a thread executes a transactional access (read or write), it uses the primitives of
the STM engine to mediate the access. The access must be mediated by the engine to
allow the synchronization of the concurrent accesses to shared data-structures.
The synchronization techniques used by STM engines may be based on blocking
or non-blocking synchronization techniques. Traditional blocking synchronization (or
lock based) usually have better performance. The alternative non-blocking synchro-
nization technique guarantees progress of the system as a whole independently of the
interleavings or any thread failures.
Blocking Synchronization
Blocking synchronization techniques use locks to prevent other threads to access the
same data. Locks are acquired before any access to the shared data and released after
the operation is complete. Threads that find a locked item must wait until the lock
is released to access the data, usually contention management algorithms are used
to avoid deadlock and starvation problems (described ahead). However, if a thread
owning a lock fails, the lock can no longer be (safely) released and the system may
freeze.
Non-Blocking Synchronization
Non-blocking synchronization techniques work by making a private working copy of
the items and, when the changes are finished, atomically swap the working copy with
the old one. The operations must be restartable because threads may collide with each
other when accessing the data.
14
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
STMs using non-blocking synchronization inherit the advantages of this technique:
they don’t suffer from deadlocks and are resilient to thread failures. If a thread fails
within a transaction, its effects are discarded without affecting the global state. On
the contrary, with blocking STMs, if there is an unforeseen exception or thread failure
while holding the locks then: either the locks stay acquired, hence making the system
halt; or the locks are released by a manager but the dirty data will become visible to
the other threads.
Non-blocking synchronization algorithms can have either of these three progress
guarantees [HLM03, HF]:
• Obstruction freedom “is the weakest guarantee: a thread performing an operation
on the data structure is only guaranteed to make progress so long as it does not
contend with other threads for access to any location (. . . ). This requires an out-
of-band mechanism to avoid livelock; exponential backoff is one option.”
• Lock freedom ensures that at least one thread always makes progress hence “the
system as a whole makes progress, even if there is contention. (. . . ) This is suffi-
cient to prevent livelock, although it does not offer any guarantee of per-thread
fairness.”
• Wait freedom “adds the requirement that every thread makes progress, even if
it experiences contention.” “It ensures that every thread will continue to make
progress in the face of arbitrary delay (or even failure) of other threads.”
A wait free algorithm is always lock free and a lock free algorithm is always obstruction
free, but not the other way around. Independently of the progress guarantees, non-
blocking synchronization mechanisms are never subject to deadlocks.
An example of how DSTM, a non-blocking STM implementation [HLMWNS03],
atomically updates the objects on commit is shown in Figure 2.8. Every transactional
object is a pointer to a Locator object. The Locator has a pointer to the status of the
transaction updating the object; an old and a new copy of the object data. When the
transaction is active, the data on the old pointer is the most recently committed data
and the new pointer is the working copy. When the transaction commits, it updates the
status from active to committed, from this moment on the most recently committed data
is under the new pointer.
Synchronization Techniques on STMs
In the context of transactions there are several arguments defending the usage of both
blocking or non-blocking approaches [HLMWNS03, Enn06, DS06]. However there has
15
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
Old Object
New Object
Transaction
TX State
Data
DataTransaction 
Object
Locator
Figure 2.8: DSTM Object Structure
been a shift towards blocking STMs. While early STM implementations started us-
ing the lock-free property [ST95], later, implementations using obstruction-free prop-
erty [HLMWNS03] were proposed and lastly the implementations using blocking syn-
chronization [Enn06, DS06, DON06, SATH+06]. This shift is due to the search for sim-
pler implementations and better performance—and blocking STMs are closer to those
goals [Enn06, DS06].
2.3.2 Transactional data and granularity
It is important that the variables read and written by transactions are not simultane-
ously accessible by non transactional code. Otherwise harmful interleavings could
occur due to unsynchronized access to these variables.
We have seen two strategies for dealing with this issue: the first is to leave to the
programmer the responsibility of avoiding non transactional accesses to transactional
variables, this is the strategy of [Enn06,DS06,DON06]; the second is to define a special
type of transactional variables, which may only be accessed by transactions, this is the
strategy of [HLMWNS03, HLM06].
Another issue when defining transactional data is the size of its granularity. In
other words, the engine needs to define the smallest amount of data a transaction may
access. The main strategies are block level (word-based STMs) and the object level
(object-based STMs). With block level granularity, the transactional data grain has the
size of a memory word or the size of a cache line. With object level granularity, the
transactional data grain has the size of an object.
Word based
Word based STMs have been first proposed by Harris and Fraser in [HF03]. These
STMs have more room for concurrency as the granularity is finer. In lock based STMs
each word has its own lock and each lock is acquired independently of the others,
therefore collisions are less frequent. However the overhead is high because the trans-
actional API must be called once for every word the transaction accesses.
For instance, one transaction that needs to access all the fields of a forty byte object
16
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
(assuming word size of four bytes) it must make ten transactional calls to read the
whole object.
Object based
Object based STMs [HLMWNS03] have one metadata structure shared among all the
fields of the object—they have a coarser granularity than word based STMs. Simulta-
neous accesses from different transactions to different fields of the same objects will
conflict. The transactional API is called once for every object (instead of every field).
Object based STMs are less suitable for accesses to transactional variables that are
not objects—like single variables and arrays. Word based STM, on the other hand, can
access single variables and access arrays word-by-word.
2.3.3 Concurrency Control
There are two main types of concurrency control mechanisms used on blocking STM
engines [SATH+06]: read-write locks and versioned write locks. Read write locks are
one of the most commonly used concurrency control methods and there are several
implementations available for most programming languages (e.g., pthread rwlock for
C; ReentrantReadWriteLock for Java).
Versioned write locks, on the other hand, use an optimistic concurrency control
for reads and revocable two phase locking for writes. Their main advantage is the fact
that reads do not need to acquire a lock, therefore they don’t need to write to the shared
lock. This is an advantage as it causes less cache coherency traffic on the shared bus
and less remote cache invalidations(hence less cache misses).
Read Write Locks
Read write locks have two types of lock acquisition, one for read and another for write.
Readers acquire the lock in shared mode before reading the value; writers acquire the
lock in exclusive mode before writing to the value. Reads can be shared among other
reads and writes are exclusive. The main disadvantage of this method is the fact that
it needs to acquire a lock when reading, this causes extra cache invalidation traffic on
the bus. Also upgrade from read to write locks are expensive [SATH+06] as the writers
need to wait for the readers to finish.
Versioned Write Locks
Due to performance reasons, most lock based STM systems use optimistic concurrency
control for reads with versioned write locks [Enn06, DS06, DON06, SATH+06]. Each
17
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
Write 
Descriptor
Version/Address Lockbit
Lock Version 0
Owner Address 1
Figure 2.9: Versioned write lock
variable has a version number that is incremented for every commit updating it. When
a transactional read is performed, the version number of the variable is recorded in a
transaction local data structure (the read set). On commit, it is verified if the versions
of the variables read have changed by comparing the actual version number with the
one recorded on the transactional read. If any of the version numbers have changed,
the transaction is in invalid state and must be rolled back, otherwise it may commit.
This prevents non-serializable orderings from committing.
The implementation of versioned write locks is usually made using a single word
(32 or 64 bits depending on the architecture) to hold the field indicating whether the
lock is held or not; the version number; and a pointer to the transaction holding the
lock (Figure 2.9).
The field (lockbit) indicating the status of the lock (acquired or not) uses a single bit.
The remaining bits (31 or 63 depending on the architecture) either hold the lock version
number (if the lock is released) or hold a pointer to a descriptor of the transaction
holding the lock. The performance advantage of the versioned write locks is due to the
fact that readers just read the lock value, they don’t have to change the lock status (like
it happens with read write locks).
The usage of versioned write locks makes reads invisible to writing transactions.
As a consequence, writes may conflict with reads, however the transaction that made
the conflicting read will abort.
2.3.4 Recovery Strategies
The STM engine needs to record information about every update that each transac-
tion tries to make. In case a transaction commits, it may need to apply the changes
and when a transaction aborts it may need to undo the changes previously made. De-
pending on the engine implementation, it may also need to register every read on a
transaction read log to validate the transaction on commit. The transaction update log
is also called write set and the transaction read log is also called read set.
On blocking STMs there are two main recovery strategies undo log and redo log.
18
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
Their behavior differs mainly in the way that the update log is managed.
Undo log
In the undo log strategy, when a write occurs, the shared variable is locked (to prevent
other transactions from accessing the dirty data) and the shared value is replaced with
the new tentative value. The previous value is recorded on the transaction private
undo log. This strategy is also called “in-place update”.
Reads check whether the variable is locked. If it’s not, they record the lock version
number on the read set and access the variable. If it is locked, the transaction may
either wait or abort.
When a commit is performed, the engine validates the read set by checking if all
reads still have the same lock version. If it is still valid the locks are released and the
undo log is discarded, otherwise the transaction aborts.
When an abort is performed, the shared variables are re-written with the values
kept in the undo log and locks are released, therefore restoring the data to the state it
had before the transaction began.
Since these transactions change the shared variables as soon as a write is performed,
the write locks must be acquired immediately. Some authors [DS06] call this strategy
“encounter locking” mode.
Redo log
In the redo log strategy, when a write occurs, the shared variable remains unchanged
and the new tentative value is written on a transaction private write set. This strategy
is also called “out-of-place update”.
A transactional read can’t look directly to the shared variable as it may have already
been written within the current transaction; instead it must first look aside into the
transaction write-set. If the variable is not on the write set, the read must check whether
the variable is locked, if it’s not locked, it accesses the variable directly and records its
version number on the read set, otherwise the transaction must wait or abort.
The advantage of this technique is that lock acquisitions may be delayed until the
commit phase. When a commit is performed, the shared variables are locked and, if
the read set is valid (the version numbers haven’t changed since the actual read), their
values are replaced with those stored in the write set. If it is not valid the transaction
aborts.
When an abort is performed, only clean-up work needs to be done.
Some authors [DS06] call this strategy, where the locks are acquired at commit time,
“commit time locking” mode.
19
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
Undo vs. Redo log
The undo log strategy has the main advantage of speeding up reads, because they look
directly into to the shared variable, they don’t need to look aside into the write set.
However, locks stay acquired for a longer period, which lasts from the first write on
the variable until the transaction commits.
The main advantage of the redo log strategy is that locks are acquired only when
performing the commit—while the values are being copied from the redo log to the
shared location. This minimizes the lock hold time and therefore the overall con-
tention. Reads, however, are slowed down by having to look aside to the write set
and check if a previous write by the current transaction has updated the variable.
2.3.5 Transaction States and Global Version Clock Algorithm
When using versioned write locks a number of hazards may occur due to the existence
of invisible reads. A transaction may read a dirty value—one that is being changed by
another transaction; a transaction may not see a valid memory snapshot because the
state changed between two consecutive reads; and it may see an old snapshot. When
such things happen it means the transactions have collided and at least one of them
has entered a state where it cannot commit and therefore it must undo all the changes
it made.
One example is shown in Figure 2.10, where the system has an invariant of x == y
but due to the interleaving of T2, transaction T1 does not observe the invariant on
step 8.
T1 T2
// invariant x==y
1 atomic {
2 a=x;
. . . . . . . . . . . . . . . . . . . . . . . . . . .
3 atomic {
4 x ++;
5 y ++;
6 }
. . . . . . . . . . . . . . . . . . . . . . . . . . .
7 b=y;
8 // invariant x==y
// is not observed
9 }
Figure 2.10: Transaction sees inconsistent state.
20
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
We divide transaction state into three categories: updated consistent, obsolete consis-
tent and inconsistent.
Updated Consistent State
A transaction running in an updated consistent state is one in which all reads have
returned the latest committed values and these values haven’t been updated since the
read. Transactions that try to commit with an updated consistent memory snapshot
will succeed.
Obsolete Consistent State
A transaction is in a obsolete consistent state if it has an outdated memory snapshot. It
may happen because another transaction updates one of the read variables. A transac-
tion running in a obsolete consistent state has a correct behavior but it has a past view
of the system and therefore it will fail to commit. Read-only transactions may commit
with an obsolete consistent state because they are only retrieving information from the
system.
Figure 2.11 shows an example of a transaction running into an obsolete state. On
step 6, transaction T1 is consistent but obsolete, commit will fail.
T1 T2
1 atomic {
2 a=x;
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 atomic {
4 x ++;
5 }
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 b=z;
7 // update something
8 }
Figure 2.11: Transaction in consistent but obsolete state.
Inconsistent State
A transaction is in an inconsistent state if the read values do not correspond to a valid
memory snapshot. It may be due to reading dirty values or read-after-write hazards
e.g., two consecutive reads to the same variable returning different values because it
was updated by another transaction. Transactions in this state may have unpredictable
21
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
behaviors because, since the system state is not consistent, all program invariants may
be broken. All inconsistent transactions will eventually abort.
Concurrency Control and Transaction States
When using read write locks (assuming a strict two phase locking model), a read is
guaranteed to always have an updated consistent state and the read variable can’t
be changed until the commit—the transaction has acquired a read lock and no other
transaction can update the value while the read lock is being held.
When using versioned write locks there is no such guarantee—transactions may
read variables while they are being written and therefore some hazards may occur.
While the transaction is being run, the loads may see dirty variables and/or see in-
consistent states as a result of unfortunate interleavings. Although these transactions
will eventually abort, they may suffer erroneous behaviors while running—they may
enter endless loops in the middle of the transaction, dereference invalid pointers, have
divide by zero exceptions, fail assertions that otherwise would be valid if the transac-
tion was being run in a consistent state, or even worse, provoke memory corruption.
Such transactions are invalid and must be aborted—however the abort may happen at
a very later stage in the transaction.
Figure 2.12, inspired on [HF03] shows an example of a system that has an invariant
x = y. Transaction T1 is interleaved with T2 and reads different values of x and y, as a
result it observes an invalid state and enters an infinite loop.
T1 T2
// x=0 and y=0
1 atomic {
2 a=x;
. . . . . . . . . . . . . . . . . . . . . . . . .
3 atomic {
4 x ++;
5 y ++;
6 }
. . . . . . . . . . . . . . . . . . . . . . . . .
7 b=y;
8 if ( a != b)
9 while(true){}
10 }
Figure 2.12: Infinite loop caused by improper ordering. Transaction T1 is inconsistent.
Another potentially hazardous behavior is memory corruption, which may have ef-
fects outside of the transactional space and the results may be undo-able. One example
22
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
is shown on Figure 2.13, where an application has for transactional variables: a pointer
to a dynamic size array (main table), the array itself and an integer specifying the array
size (main table size). Let’s suppose a STM engine for a C like programming language,
which makes the writes in-place and allows transactions to run in inconsistent states.
One transaction is updating one of the array elements and another is switching the
array to another one of a smaller size.
T1 T2
// main table size =100
1 atomic{
2 size=main table size
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 atomic{
4 main table size=10
5 main table=new table(10)
6 }
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 table=main table
8 offset = table+( size−1)
9 ∗ offset =1
10 }
Figure 2.13: Memory Corruption when accessing indexed data.
The problem occurs when T1 reads the size of the old table and the pointer to the
new table (with a smaller size). When the write occurs on line 9, the table offset ( offset)
is larger than the array dimension. Therefore, if the write is not prevented by the STM
engine, it will change memory outside the array. It may corrupt memory allocated for
other transactions, or even for non-transactional use. Even tough the old values will
be restored when the transaction aborts, the value has already flickered, which may
cause problems if some other thread reading the changed value. One way to avoid the
problem is to revalidate the read set before every write, but the performance cost may
be too high.
The problem described on Figure 2.13 does not happen with the redo log strategy
because the shared variables are only changed after the read set is validated. Also, with
programming languages like Java, this error would throw an exception that could be
caught by the STM engine.
The transactional engine may detect all the mentioned hazards and prevent the
user code to run in inconsistent state by re-validating the whole read set at every trans-
actional load. However the size of the read set increases with the number of loaded
variables and re-validating the read set at every transactional load becomes expensive
23
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
for longer transactions. The cost is O(n2) where n is the number of transactional reads.
This is why several systems opt to revalidate the read-set periodically or only at com-
mit time.
TL2 [DON06] uses a different algorithm to avoid this problem. Instead of using a
version counter per variable, it uses a global version clock algorithm with the clock
being incremented for every commit. When a transaction starts, it reads the global
version clock into a thread local storage, which is the transaction timestamp. Transac-
tional loads check if the lock version of the variable is smaller or equal to the transaction
timestamp. If it is the case the transaction proceeds, otherwise it immediately aborts.
The global version clock algorithm detects when the read set is not consistent and
prevents user code to run in an inconsistent state. This approach is more efficient than
revalidating the read set at every transactional load, as it requires only the lock of the
loaded variable to be checked. The cost is O(n) where n is the number of transactional
reads.
The validation made by the global version clock algorithm is weaker than the en-
tire read set validation. Validating the entire read set detects transactions running in
inconsistent and obsolete states, while the global version clock algorithm only detects
inconsistent states (which is not a problem since obsolete states do not cause the trans-
action to have erratic behaviors). Obsolete states are detected at commit time. This
algorithm also has more false positive aborts: situations were the transaction shouldn’t
abort but it does. An example is show on Figure 2.14 where, on step 5, transaction A
detects a conflict reading variable x and aborts.
T1 T2
1 atomic {
. . . . . . . . . . . . . . . .
2 atomic {
3 x ++;
4 }
. . . . . . . . . . . . . . . .
5 a=x;
6 }
Figure 2.14: False positive with the global version clock algorithm.
2.3.6 Lock Placement
There are two main strategies for lock placement on lock based STMs. The first is to
place the lock next to the data. The second is to place the lock on a separate table.
24
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
Header Header Header
Next Next Next
Value Value Value
Key KeyKey
Figure 2.15: Sample list implementation using adjacent locks
Lock adjacent to the data
Placing the lock next to the data has the advantage of having a higher chance that both,
the lock and data, will stay on the same cache line. This may yield a better performance,
by reducing the number of cache misses.
However, placing the lock and data next to each other, requires changes to the struc-
ture of the objects on the heap (example in Figure 2.15). Also, the object handling is
different. Without compiler support, the developer holds a pointer to the header and,
to access the data itself, the pointer is incremented by the header size. These changes
limit the possibility of reusing of existing libraries, as they change both the structure
of the objects and the way they are handled. Some STM engines [HPST06] have used
compiler support to automatically add an extra header.
Placing the lock next to the data also brings a difficulty when using non garbage
collected programming languages. If the object is deleted and freed, the header is also
deleted. If other threads are still referencing the freed object, they may end up reading,
or even worse, writing to freed memory. Even tough the transaction will abort and the
old value are restored, writes to freed memory must be prevented by the STM engine.
On non garbage collected languages, STMs must use either a specialized malloc
and/or free implementation [Enn06, SATH+06] or a memory system that does not
allow memory still referenced by any transaction to be reused for non transactional
use [DS06, DON06]. On garbage collected languages this is not a problem, since the
runtime system only releases the object’s memory when there are no more references
to it.
Lock in a separate table
Placing the locks on a separate table does not require changes to the heap objects, nor
to the way they are handled. With this technique it is easier to re-use existing libraries,
as it only requires the memory accesses (reads/writes) to be instrumented.
To map a variable to the lock in the lock table, it is necessary to have a hash function
that maps the variable address to a position in the table.
One practical problem with the lock table approach is the fact that several memory
25
2. SOFTWARE TRANSACTIONAL MEMORY 2.3. Design approaches
addresses may be mapped to the same table position (lock collision). While not being
a correctness problem (provided that the STM engine is prepared for this matter), it
may result in a performance degradation. If there are collisions on the lock table, some
variables will be seen as locked, but in fact the lock was made for another variable
whose address maps to the same position in the lock table. As a result, there may be a
performance impact due to an increased number of unnecessary aborts.
2.3.7 Contention Management
Obstruction free and lock based STMs have the need for an out-of-band mechanism to
solve transaction collisions and avoid livelocks and deadlocks (for lock based STMs).
In some systems, the responsibility of guaranteeing progress is left to a contention
manager [HLMWNS03], which decides which transaction should abort, in case a con-
flict occurs. When a transaction tries to update a transactional variable and it detects
conflict with another transaction, it informs the contention manager about the pend-
ing conflicts. The contention manager uses heuristics to implement conflict resolution
policies. Simplest heuristics may be: to always abort the conflicting transaction; to
always abort the requesting transaction; or to back-off and retry a few times before
aborting the conflicting transaction. More complex contention management heuristics
are described in [WNSS05].
2.3.8 Support for Long Running Transactions
Some types of applications require the use of large transactions—for instance a trans-
action that transverses a large transactional list to count the number of elements. These
large transactions suffer from having a higher probability of colliding with other trans-
actions. In the list example, if a node is added or deleted while the list transversal is
occurring, a collision may occur and one of the colliding transactions must either i) wait
or ii) abort. Both strategies (waiting or aborting) have severe drawbacks when dealing
with large transactions: waiting for a large transaction to finish may slow down the
system or even halt it because every other transaction is waiting for the large transac-
tion to finish; and aborting may lead to a livelock on the large transaction as it is never
able to finish successfully.
The STM engine [CRS06] uses a technique to prevent the problem of long running
read only transactions. This technique, which is common on database systems, regis-
ters the history of changes of the transactional values and it allows read-only transac-
tions to succeed, even if there are writing transactions running in parallel. With the
history of changes, transactions can look at the old values and therefore they can con-
26
2. SOFTWARE TRANSACTIONAL MEMORY 2.4. A Bit of History
sult the memory snapshot as it was when the transaction started.
2.4 A Bit of History
Software transactional memory has come a long way since its introduction by Shavit
and Touitou in 1995 [ST95]. Many new ideas have been proposed and many have
been abandoned. We describe now some of the most significant events in Software
Transactional Memory history.
The initial STM implementation of Shavit and Touitou used the lock-free property,
which provided resilience to thread delays, failures and avoidance of deadlocks and
livelocks. Later, in 2003, the obstruction freedom property was introduced on trans-
actional memory by Herlihy et al., [HLMWNS03], which proved to make the STM
implementations simpler and more efficient. In their paper, they also created the first
STM engine that allowed the dynamic creation and deletion of objects within the trans-
action. Whereas the prior implementations required the variables and transactions to
be statically defined in advance.
Another interesting advance was made in 2003 by Harris and Fraser [HF03], where
they proposed, among other things, the first word based STM engine that did not re-
quire additional space allocation per each variable and it was therefore more conve-
nient to use.
Nowadays most STM implementations use lock based algorithms, which are not
resilient to thread failures nor delays, however they have a significantly better perfor-
mance [Enn06,DS06,DON06]. Figure 2.16 taken from [NSS] illustrates this trend in the
design of the STM engines.
Since the appearance of lock based STM engines, there has been a broad range of
ideas regarding design options. An interesting study made by Saha et al., [SATH+06] in
2006 shows that versioned write locks have much better performance than read write
locks. They also pointed out that the undo log had lower overhead than the redo log
strategy due to the avoidance of look asides.
Later, Dice et al, introduced a STM engine in 2006 which used a redo log strategy
with the difference that locks were acquired only at commit time. This solution had
better performance than the undo log strategy when contention is higher. There is
no consensus on which strategy is better, each one has better performance on specific
scenarios.
In 2005 Harris et al. [HMPJH05] proposed a new set of features (conditional waiting
and composing alternatives) for transactional memory. These new features have proved
to extend the range of problems that transactional memory can address.
27
2. SOFTWARE TRANSACTIONAL MEMORY 2.4. A Bit of History
Figure 2.16: STM History
Nowadays, much of the work is still towards optimizing the performance of the
STM engines. Harris et al., in 2006, proposed several new optimization strategies
in [HPST06]. The integration of the STM engines into the compiler has also been sug-
gested as a way to increase its performance. Both Harris et al and Adl-Tabatabai de-
scribe it on [HPST06, ATLM+06].
28
Chapter 3
Prototype Development
This Chapter describes the TL2 STM engine, in which this work is based. It also de-
scribes in detail the changes made to TL2 which originate this thesis.
29
3. PROTOTYPE DEVELOPMENT 3.1. Introduction
3.1 Introduction
Our prototype began with a port of TL2 to Linux/X86/GCC (TL2 was originally de-
veloped for Solaris/SPARC/SUNPRO C compiler). Afterwards, several modifications
were made to implement new features, aiming at achieving better performance than
the original TL2, increasing its safety and facilitating the debugging of the engine.
Some of the added features were: user called aborts, automatic transaction retry and
transaction nesting. Performance modifications were related to log management strate-
gies, transactional granularity, partial state validation and algorithmic changes to op-
timize the load operation. Test related modifications were the creation of a new event
tracing engine.
In the next section we describe the features and implementation options of the orig-
inal version of TL2. The followings sections describe the modifications made to TL2.
3.2 Original TL2
TL2 was chosen among other STM implementations because it has two interesting fea-
tures: i) it doesn’t require the use of specialized malloc/free implementations (even for
non garbage collected languages), allowing memory to be recyclable between transac-
tional and non-transactional space; ii) by using the global version clock algorithm it
prevents transactions from running in inconsistent states.
The original version of TL2 was word based; used a redo log logging strategy; and,
depending on compile time options, it allowed locks to be placed on a separate table
or adjacent to the data.
3.2.1 Lock Structure and Placement
The default working mode of TL2 places locks on a separate table. The structure of the
lock table in TL2 is shown in Figure 3.1. The addresses of the transactional variables
are hashed to get the lock position in the lock table. TL2 also allows the locks to be
placed adjacently to the object.
The lock itself has two fields, one containing the status (acquired or not) which is
located on the least significant bit and another containing the remaining bits. If the
lock is 0, then the lock is released and the remaining bits hold the version number of
the variable. If the bit is 1, then the lock is acquired and the remaining bits hold the
address of the write set entry of the transaction holding the lock. Having one less bit
to hold the pointer is not a problem since the addresses of the write set entries are, at
least, four bytes aligned, therefore the last two bits are unused.
30
3. PROTOTYPE DEVELOPMENT 3.2. Original TL2
Version
Address Hash
Write 
Descriptor
1
0
Version/Address Lockbit
Address Hash
Variable
Variable
Figure 3.1: Lock Table in TL2
3.2.2 Word Mode
This is the mode that was originally implemented in TL2 and the API is as show on
Figure 3.2. The accesses are either reads or writes made to single words and they are
all intermediated by the STM engine. Word based mode can also be used to access
objects, treating object fields as words.
1 Tx Star t ( Thread ∗ t ) ;
2 TxCommit ( Thread ∗ t ) ;
3 i n t p t r t TxLoad ( Thread ∗ t , i n t p t r t ∗addr ) ;
4 void TxStore ( Thread ∗ t ,
5 i n t p t r t ∗addr ,
6 i n t p t r t value ) ;
Figure 3.2: Simplified API of TL2
3.2.3 Redo Log Strategy with Consistent State Validation
The original TL2 works in redo log strategy with consistent state validation and it
follows the algorithm bellow:
When a transaction starts, it reads the global version clock into the transaction times-
tamp.
31
3. PROTOTYPE DEVELOPMENT 3.2. Original TL2
On a transactional load, the transaction first checks if the variable is already in the
write set, if it is, then the variable has already been written in this transaction and it
returns the value on the write set; otherwise it logs the read on the read set and returns
the value. The consistent state validation is performed by checking the lock version
before and after reading the variable.
For the read to be valid, three constraints must apply (Figure 3.3):
• Lock isn’t held—to avoid collisions with other writing transactions.
• Lock version is the same in both checks (before and after the read).
• Lock version is lower or equal to the transaction timestamp (t→rv)— to avoid see-
ing inconsistent states (otherwise the variable may have changed since the begin-
ning of the transaction).
If constraints apply, the read is successful and the read set of the transaction continues
to be consistent. Otherwise it aborts.
1 TxLoad ( Thread ∗ t , i n t p t r t ∗addr ){
2 l o c k v e r s i o n 1 =GetLock ( addr ) ;
3 value=∗addr ;
4 l o c k v e r s i o n 2 =GetLock ( addr ) ;
5 i f ( i s v e r s i o n ( l o c k v e r s i o n 1 ) &&
6 l o c k v e r s i o n 1 ==l o c k v e r s i o n 2 &&
7 lock vers ion1<=t−>rv ){
8 . . .
9 return Value ;
10 } else {
11 abort ( ) ;
12 }
13 }
Figure 3.3: Consistent state validation in Redo Log/Word based mode
On a transactional store, the transaction simply logs the write on the write set.
When a transaction commits, the variables in the write set are locked, the read set is
validated, the global version clock is incremented, the new variable values are copied
from the redo log to their positions and finally the locks of the written variables are
released with an updated version corresponding to the new global version clock num-
ber.
Aborts simply discards the redo log.
32
3. PROTOTYPE DEVELOPMENT 3.3. TL2 Port to X86
3.2.4 Contention Management
Since TL2 is a blocking/lock based STM, it has to deal with deadlock/livelock prob-
lems. On TL2, when a transaction finds a locked variable, it aborts and retries the
transaction. The retry is delayed for a random amount of time which is exponential
with the number of retries.
3.3 TL2 Port to X86
Original TL2 was developed for Solaris on SPARC/64bit with SUNPRO C compiler.
We had to port it, since the hardware and software we had available was Linux on
X86/32bit with GCC compiler. On the port to this architecture several modifications
were done as described ahead.
3.3.1 Solaris to Linux
TL2 uses the Solaris scheduler control mechanism schedctl [sch, DS07], which allows
threads to request the kernel to avoid being preempted for a brief period. TL2 uses
this mechanism while the locks were being held, to try to avoid the situation where a
thread looses the CPU while holding locks, forcing other threads to abort or spin for
the lock.
On the port to Linux, the TL2 scheduler control was removed, since this is not
available on Linux.
3.3.2 SPARC to X86
TL2, as well as several other STM implementations, use native Compare And Swap
(CAS) instructions to implement the locking mechanism. Since SPARC has a differ-
ent instruction set than X86 and since these instructions aren’t available as portable
libraries in C, there was the need to port them using GCC inline assembly for X86 [int].
The implementation was inspired on the Ennals STM engine [Enn06].
TL2 uses a SPARC non-faulting load instruction [WG], which allows a thread to
read a memory location like a regular load, except for the fact that it does not throw
an exception ( e.g., segmentation fault) when the memory location is not in the process
address space, instead the instruction returns zero. Original TL2 used this instruction
to read the lock values and to load transactional variables due to the fact that trans-
actions may try to read memory locations already released by other transactions. An
example is a thread trying to read a list node while another is trying to release the very
33
3. PROTOTYPE DEVELOPMENT 3.4. New Features
same node. It may happen that the first transaction gets a pointer to the list node and,
before reading it, the second transaction removes the node from the list, commits and
free’s the node. When the first transaction tries to read the node, it has already been
released and it could result in an exception.
Since the non-faulting load instruction is not available on X86, the solution was to
implement a fault handler with setjmp/longjmp instruction pair. On the beginning of
the transaction, setjmp saves the processor context into a memory buffer; if an excep-
tion (e.g., segmentation fault) is caught, longjmp restores the processor context and the
execution continues as if setjmp had just been called, in other words the transaction is
restarted.
The use of setjmp/longjmp adds a limitation to the prototype. The instruction setjmp
saves the caller context in a buffer, which includes the stack pointer but not the stack
contents. Therefore longjmp must be called inside the same or on a deeper stack frame
than setjmp, otherwise longjmp may find an invalid stack frame. The consequence is
that the function starting the transaction can only return after it commits or aborts.
3.3.3 64 bit to 32 bit architecture
TL2 was implemented for a 64 bit architecture. Porting it to a 32 bit architecture was
not trivial because of the use of versioned write locks. Since, on a 32 bit architecture,
the version numbers are only 31 bits long (1 bit is reserved to indicate whether the
lock is held or not) and they are incremented at every commit, the roll-over time for
the version counter is drastically reduced. On a 1CPU/1Ghz machine, assuming each
transaction takes 10.000 cycles to complete, the roll-over time is ≈ 6 hours.
One option to solve this problem would be to handle the overflow, by reseting the
counter when there are no active transactions (eventually forcing every transaction
to abort); another is to use a wider version lock to avoid the problem, at least as a
practical concern. The option chosen was to use 64bit version locks, by using the X86
SSE instruction set. SSE instructions can atomically load, store, and CAS 64bit words
and the roll-over time is increased from ≈ 6 hours to ≈ 3 million years, using the same
1CPU/1Ghz machine.
3.4 New Features
We now describe the features we introduced into the prototype.
34
3. PROTOTYPE DEVELOPMENT 3.4. New Features
3.4.1 User Aborts
Sometimes it is convenient for the programmer to abort explicitly a transaction and
undo its effects. An example (Figure 3.4) is a transaction wishing to receive a message
from a queue and after receiving the message discovers that the message is not “inter-
esting”, so the thread needs to undo every change to the queue. This functionality was
added to the prototype. The algorithm is described in Section 2.3.4.
1 atomic{
2 msg = Receive ( ) ;
3 i f ( ! i s i n t e r e s t i n g (msg ) ) {
4 abort ( ) ;
5 }
6 }
Figure 3.4: User specified abort
3.4.2 Automatic Transaction Retry
This extension allows an aborting transaction to be retried automatically. If a transac-
tion detects a collision with another transaction, one of them must abort. The example
on Figure 3.5, shows a transaction that found a collision on step 3 and is going to abort.
The aborting transaction has several options regarding the control flow: i) it may con-
tinue its control flow inside the transaction, ignoring any further transactional writes
(continues to step 4); ii) it may abort, with the control leaving the transactional block
(goes to step 6); or iii) it may automatically retry, placing the control flow on the start
of the transaction (restarts on step 2).
1 / / . . .
2 Tx Star t ( ) ;
3 a=TxLoad ( x ) ; / / c o l l i s i o n i s d e t e c t e d
4 b=TxLoad ( y ) ;
5 TxCommit ( ) ;
6 / / . . .
Figure 3.5: Control flow after abort
Options 1 and 2 have the semantic at most one successful commit and the problem is
that programmers must be aware that transactions may not finish successfully, so the
commit return status must always be tested. Developers must place the transaction
inside a while loop to guarantee a successful commit (Figure 3.6).
35
3. PROTOTYPE DEVELOPMENT 3.4. New Features
1 / / . . .
2 do{
3 TxStar t ( ) ;
4 a=TxLoad ( x ) ;
5 b=TxLoad ( y ) ;
6 s t a t u s = TxCommit ( ) ;
7 }while ( ! s t a t u s ) ;
8 / / . . .
Figure 3.6: Non retrying transaction placed inside a while loop
The option 3 (automatic retry) has a semantic exactly one successful commit and is
a safe option to be used without placing the transaction code inside a loop. There
has been two options for implementing the automatic retry mechanism: using setjmp/-
longjmp instructions if they are available, calling setjmp on the beginning of the transac-
tion and calling longjmp if the transaction aborts; or placing the transaction code within
a function/method [HLM06], this way the transaction is started via a call to the trans-
action manager, which in turn automatically calls the transactional code inside a loop.
An example is show in Figure 3.7. Original TL2 had the option 1 implemented (con-
trol flow continues inside the transaction). Our prototype implemented the option 3
(automatic retry) using the setjmp/longjmp instructions.
1 i n t TransactionXPTO ( ) {
2 a=TxLoad ( x ) ;
3 b=TxLoad ( y ) ;
4 return 1 ;
5 }
6
7 / / . . .
8 Cal lT ran sa c t i on ( TransactionXPTO ) ;
Figure 3.7: Non retrying transaction placed inside a while loop
3.4.3 Transaction Nesting
We have added to our prototype support for transaction nesting. Our implementation
is partially based on closed nesting.
When a transaction starts, it creates a new transaction descriptor and all load and
store operations are recorded in the transaction log of that descriptor. When a transac-
tion commits, the transaction log is concatenated with the parent transaction.
36
3. PROTOTYPE DEVELOPMENT 3.5. Performance Related Changes
When a user explicitly aborts a sub-transaction, the changes made by the sub-
transaction are undone and the sub-transaction log is discarded. In other words, only
the effects of the sub-transaction are undone. However, when a sub-transaction finds
a collision with other transactions the whole transaction is rolled back (including the
main transaction) and retried because the variable that caused the collision may have
also been read by the main transaction. One way to avoid this, would have been to
validate the entire read-set of all transactions when a sub-transaction aborts.
The use of large nested transactions along with high collision rates, suffers the same
problem as non nested transactions—high abort rates.
Nesting was implemented for all combinations of the prototype—undo log and
redo log.
3.5 Performance Related Changes
The use of concurrent programming aims at taking advantage of the multiple cores
that computers have to improve the application throughput or decrease its processing
time. However, programming concurrent applications introduces overheads related to
the synchronization of tasks and adds contention to shared structures.
The overhead is related to the additional instructions that have to be executed to
perform the synchronization and it increases with the number of synchronization calls.
The contention is related to the simultaneous accesses to the shared structures, (either
application data structures or synchronization structures, like locks) which leads to
some of the tasks having to wait for a lock to be released or may cause additional
cache coherency traffic. Contention increases with the number of threads accessing
simultaneously the same shared structures.
The effects of the overhead of the synchronization mechanisms are mostly visible
when using a low number of CPU cores because the overhead may be higher than the
gain of using more than one CPU core. The effects of contention of synchronization
mechanisms are mostly visible when using a high number of cores on applications
concurring for the same shared structures. The probability of having simultaneous ac-
cesses to the shared structures increases, leading to higher waiting times for the locks,
more cache coherency traffic and more cache misses.
Next we describe the changes we made to the prototype focused on reducing the
overhead and minimizing the contention.
37
3. PROTOTYPE DEVELOPMENT 3.5. Performance Related Changes
3.5.1 Undo Log Strategy with Consistent State Validation
The undo log strategy is a modification of TL2. In this mode the updates are made in
place and the locks are acquired as soon as the variable is written.
When a transaction starts, it reads the global version clock into the transaction times-
tamp.
On a transactional load, the transaction logs the read operation on the read set and
returns the value. The consistent state validation is performed by checking the lock
version before and after reading the value and doing the same constraint verifications
as in redo log/word based mode (Section 3.2.3). These checks are bypassed if the
owner of the lock is the current transaction.
On a transactional store, the transaction verifies if the lock version is lower or equal
to the transaction timestamp1, acquires the lock, records the value in the undo log and
stores the new value in place. If the variable is already locked by this transaction, the
new value is simply stored in place, otherwise the transaction immediately aborts.
When a transaction commits, the read set is validated, the global version clock is
incremented and finally the locks of the written variables are released with the lock
version set to the updated global clock version. If the transaction is read only, the
commit always returns successfully because the read set is guaranteed to be consis-
tent (although it may be obsolete). Committing an obsolete but consistent read only
transactions is not a problem.
Aborts increment the global version clock, copy the variable values from the undo
log back to their original position and release the locks with the updated version of the
global version clock number. Aborts do have to increment the value of the lock at least
on read-write transactions, because the written (dirty) values may have been read by
another transaction and the way to abort the other transaction is to update the clock
version—even tough the value is restored same when the transaction started.
3.5.2 Object Mode
Object mode is another modification to the original TL2 and it is only implemented
for the undo log strategy (described in Section 3.5.1). The API used to handle the ob-
jects is different from the word based mode. Like the STM engine of [HPST06], reads
and writes are made directly to the objects after the corresponding open calls. In other
words, after the object is opened, it may be accessed directly from the application with-
out the intermediation of the transactional engine. The side effect of using this mode
1This is necessary to make sure that read-write variables haven’t changed between a prior read and
the write and to ensure that this transaction is not writing to a freed variable.
38
3. PROTOTYPE DEVELOPMENT 3.5. Performance Related Changes
(and the usage of optimistic read locks) is that the transaction code may run inconsis-
tent. A transaction may read a value while it is being written by another transaction.
The way to prevent this side effect is to explicitly validate the object’s version after
accessing it.
The API of object mode is shown on Figure 3.8 and it works in the following way.
Before an object is read, TxOpenRead is called and it records the read operation on the
read set. After TxOpenRead is called, the object can be accessed for read directly with-
out intermediation of the engine. However each read may render the transaction to an
inconsistent state. To return to a consistent state the transaction must call TxVerifyAddr
on the object read. TxVerifyAddr checks if the lock version of the object is lower or
equal than the transaction timestamp. If it is, the object hasn’t been changed since the
beginning of the transaction and the transaction continues to be consistent, otherwise a
collision has occurred and the transaction immediately aborts. The macro TxReadField
facilitates the usage by doing everything in one step—reads the value, calls TxVeri-
fyAddr and returns the value.
Before an object is written, TxOpenWrite is called. It checks if the lock version is
smaller or equal to the transaction timestamp, acquires the lock and records the current
object data in the undo log. If the object is already locked by this transaction, TxOpen-
Write simply returns, otherwise the transaction immediately aborts.
1 TxOpenRead ( Thread ∗ t , void ∗addr ) ;
2 TxOpenWrite ( Thread ∗ t ,
3 void ∗addr ,
4 i n t s i z e ) ;
5 TxVerifyAddr ( Thread ∗ t ,
6 void ∗addr ) ;
7 # define TxReadField ( t , addr , f i e l d )
Figure 3.8: Simplified API for handling objects in object based mode
When using object mode, the object fields are accessed directly without intermedi-
ation of the transactional engine. This allows the object accesses to be optimized by the
compilers. In word based mode, the read accesses are behind a function call, which
results in a performance loss due to the function call overhead and to the fact that the
memory accesses can’t be reordered by the optimizer.
In summary, the advantages over the word based modes are: only one entry in
the read set and one lock verification per object; object fields can be accessed directly
(without transactional API); and when accessing multiple fields of an object sequen-
tially, only one validation is enough to verify the state consistency. The disadvantages
are the coarser lock granularity, resulting in a lower potential for concurrency.
39
3. PROTOTYPE DEVELOPMENT 3.5. Performance Related Changes
The way we envision the future implementation of object based mode with redo
log strategy on this prototype is for the application to hold two pointers to the object
when writing into it. One pointer for the real address of the shared object and another
for the private write buffer. The first would be the shared variable itself and used
as the argument to the TxOpenRead and TxOpenWrite calls. The second would be the
transaction private write buffer and be used to write to the object. This way we could
access the object fields directly by using the private buffer, however at a cost of having
to hold to pointers.
3.5.3 Full Validation, Partial Validation and No Validation
With object based mode the use of TxVerifyAddr is not required for the transaction to
finish correctly; it is only required to guarantee that the transaction runs in consistent
memory states e.g. to prevent infinite loops. Other STMs that do not use the global
version clock algorithm may also achieve this by revalidating the read set after every
read, but it would have a cost of O(n2)—where n is the number of transactional reads.
This prototype does this validation with a cost of O(n).
However, even with the global version clock algorithm, validation has a non negli-
gible cost (discussed in Chapter 5) and therefore state validation should be minimized
as much as possible. One possibility to reduce this overhead is for state validation to
be made only on the places necessary to avoid unacceptable transaction behaviors like
infinite loops, failure of valid assertions, etc. For instance, when accessing multiple
fields of an object sequentially, only one validation is usually enough to verify the state
consistency.
We divide consistent state validation in three modes: full state validation, partial state
validation and no state validation. Full state validation is achieved by verifying the lock
version every time a read to a transactional variable is made and therefore the transac-
tion is always consistent. No state validation is achieved by never verifying the transac-
tion state while the transaction is running. In other words, the object is opened before
any of the fields are read but the engine never verifies if the state continues to be con-
sistent. Partial state validation is achieved by selectively choosing the places to validate
the state. This may be a feature of a compiler or a transactional access optimizer, which
identifies sequential accesses to the same object and performs validation only after all
sequential accesses are finished.
Next, we show three examples, one with full state validation, another with partial
state validation and another with no state validation. They show a part of a transaction
reading the key and value of a list node. After the read, an assertion is made to verify
the invariant that the keys must be greater than zero. The assertion must be checked
40
3. PROTOTYPE DEVELOPMENT 3.5. Performance Related Changes
within a consistent state; otherwise, interleavings with other transactions may fail the
assertion even if it is valid according to the algorithm.
The first example (Figure 3.9) shows a transaction running with full state validation.
This approach makes the transaction always run in a consistent state—at the cost of ad-
ditional checks. The assertion can be made immediately after reading the key, because
the state is guaranteed to be consistent.
1 TxOpenRead ( t , node ) ;
2 key = TxReadField ( t , node , key ) ;
3 a s s e r t ( key>0) ;
4 value = TxReadField ( t , node , value ) ;
Figure 3.9: Full state validation
The second example (Figure 3.10) shows a transaction running with partial state val-
idation. The validation is performed after reading the key and value. This approach
however allows the transaction to run inconsistent in steps 2 and 3. After the revalida-
tion on step 4, the transaction is again consistent until the next transactional load. The
assertion can only be made after the verification finishes to guarantee that it is issued
in a consistent state.
1 TxOpenRead ( t , node ) ;
2 key = node−>key ;
3 value = node−>value ;
4 TxVerifyAddr ( t , node ) ;
5 a s s e r t ( key>0) ;
Figure 3.10: Partial state validation
The third example (Figure 3.11) shows a transaction running with no state validation.
The transaction code may run in inconsistent state and therefore no assertions can be
issued (unless the entire read set is validated).
1 TxOpenRead ( t , node ) ;
2 key = node−>key ;
3 value = node−>value ;
Figure 3.11: No state validation
The first approach (full state validation) is simpler to be used by the programmer, be-
cause there is no need to keep track of which objects have to be verified for consistency.
41
3. PROTOTYPE DEVELOPMENT 3.5. Performance Related Changes
The second approach (partial state validation) has less overhead due to the fewer
number of validations made. Yet, it is more complex for a programmer to use it directly.
It is better suited for compilers that generate the transactional code from a higher level
language. The compiler may validate the transaction state after a sequence of transac-
tional reads, or before some event that needs a consistent state — e.g., before issuing
an assertion, before testing a loop condition, etc.
The third approach (no state validation) has even less overhead. However transac-
tions run totally inconsistent and nothing can be ensured while the transaction is run-
ning. Only when a commit returns successfully, it is guaranteed that the transaction
was successful and no harmful interference occurred.
One simplification may be done when a write lock is acquired. Since transactional
writes acquire object locks, the reads don’t need to be validated on objects opened for
write (Figure 3.12).
1 TxOpenWrite ( t , node , s iz eo f ( node t ) ) ;
2 key = node−>key ;
3 a s s e r t ( key>0) ;
4 value = node−>value ;
Figure 3.12: Consistent state validation for locked variables
3.5.4 Reducing TxLoad Overhead
The performance of the STM engine is mostly dominated by the efficiency of the trans-
actional load operation. Usual applications have a number of reads which is far su-
perior to the number of writes [HPST06] and therefore to the remaining transactional
calls—stores, starts, commits, aborts, etc. Therefore, any improvement on the load op-
eration has much more significant impact than any improvement on other operations
(Amdahl’s Law).
On the other hand, as the number of threads increases, the likelihood of threads col-
liding in the access to shared data also increases. The shared data may be the data vari-
ables themselves, locks, or any other shared structures. Reducing contention should,
therefore, be focused on reducing the number of accesses to shared data or restructur-
ing the access pattern to have less collisions.
We have designed an improvement to the TxLoad algorithm that reduces the num-
ber of steps on the transactional load and simultaneously reduces the accesses to the
shared variables.
The load operation of the original TL2 checked the lock version before and after
reading the variable (Figure 3.3). Afterwards it verified three constraints:
42
3. PROTOTYPE DEVELOPMENT 3.5. Performance Related Changes
i) Lock isn’t held.
ii) Lock versions are the same between both checks.
iii) Lock version is lower than the transaction timestamp (t→ rv).
The first two checks verify if the variable isn’t dirty by ensuring that it hasn’t changed
between the two lock validations and that it is not being modified by another trans-
action. The third check verifies if the variable has changed since the beginning of the
transaction.
TxLoad operation was improved by changing the commit algorithm in a way that
allows the load operation to verify the lock version only once, while keeping all al-
gorithmic properties. The idea is that, if one transaction changes the variable when
another is reading it, then the lock, on the posterior validation, will be either acquired
or have a version which is greater than the transaction timestamp. Therefore, the lock
version validation prior to the read of the value (as per TL2 algorithm) can be avoided
(Figure 3.13). The implementation of this optimization required a modification in the
commit algorithm.
1 . . .
2 TxLoad ( Thread ∗ t , i n t p t r t ∗addr ){
3 value=∗addr ;
4 l o c k v e r s i o n =GetLock ( addr ) ;
5 i f ( i s v e r s i o n ( l o c k v e r s i o n ) &&
6 lock vers ion<=t−>rv ){
7 . . .
8 return Value ;
9 } el se {
10 abort ( ) ;
11 }
12 }
Figure 3.13: Optimized TxLoad algorithm.
The commit algorithm had to be changed to have the global version clock only
being incremented after the values are safely written in their final locations. This is
to avoid the situation shown in Figure 3.14 where a dirty read is not detected when
TxCommit is interleaved with another transaction.
Note that the clock is always incremented by 2, odd numbers mean the lock is held,
even numbers mean the lock is released.
The solution to the problem is to increment the global version clock after the values
have been written and flushed to memory 3.15. This new setup requires redo log to be
43
3. PROTOTYPE DEVELOPMENT 3.5. Performance Related Changes
T1 T2 Description
1 TxCommit Commit starts
2 Acquire Locks Locks are acuired
3 Increment Global Clock Clock is incremented to 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 TxStart TX starts and timestamp is sampled
on value 12
5 TxLoad Load starts
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Validate ReadSet Read set is validated—transaction
will commit
7 Apply redo log Read value T1 applies the redo log while T2
reads the variable value—value read
is dirty
8 Release locks Locks are released with the new
version set to 12
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 Read lock T2 checks the lock version and since
it is equal the transaction time
stamp (12), it considers the read to
be valid, although it was dirty.
Figure 3.14: Redo log mode: Reference to Non-Transactional Memory.
flushed to memory (via a CPU memory barrier [McK05]) to prevent the compiler and
the CPU from re-ordering the instructions.
3.5.5 Block Size
Another experiment with the STM engine was the change of the block size. The change
is only implemented in the undo log word based mode and the goal is for the STM
engine to have less locks and lower overhead on acquisition and release of the locks at
the cost of a coarser lock granularity. In the regular undo log mode, when a transaction
writes to a variable, it locks the variable and stores its previous value in the undo
log. When the block size is changed to the cache line size and a transactional store is
issued, the STM engine locks the entire block (one lock is enough for the full block) and
records the value of the full block in the undo log. The advantage is the fewer number
of locks/unlocks and copies to the undo log but it has a coarser lock granularity with
higher probability of false contention. This technique is also used by [SATH+06].
The block size must be coordinated with the application’s memory allocation rou-
tine. The application must allocate a size which is multiple of STM block size and the
memory must be aligned to the block size. Otherwise the STM engine, when copying
44
3. PROTOTYPE DEVELOPMENT 3.5. Performance Related Changes
T1 T2 Description
1 TxCommit Commit starts
2 Acquire Locks Locks are acuired
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 TxStart TX starts and timestamp is sampled
on value 12
4 TxLoad Load starts
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Validate ReadSet Read set is validated—transaction
will commit
6 Apply redo log Read value T1 applies a the redo log while T2
read the variable value—value read
is dirty
7 Increment Global Clock Clock is incremented to 14 (clock is
always incremented by 2)
8 Release locks Locks are released with the new
version set to 14
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9 Read lock T2 checks the lock version (14), since
it is greater than the transaction time
stamp (12), the read is not valid and
the transaction is aborted.
Figure 3.15: Redo log mode: Reference to Non-Transactional Memory—Corrected ver-
sion
45
3. PROTOTYPE DEVELOPMENT 3.6. Safety Improvements
to the undo log, may read unallocated memory; and when aborting a transaction it
may write to memory allocated for other uses. Therefore the use of this functionality is
not safe unless the application handles this by, for example, allocating an aligned array
of objects.
3.5.6 Lock Adjacent to Data
With object based mode (detailed in Section 3.5.2) we extended the prototype to sup-
port lock placement adjacent to the object, as well as, on a separate table.
To place the lock next to the data it is necessary to add to the data structure a field
that will hold the lock information. Since our prototype was implemented in C pro-
gramming language and C does not have any automatic way to add fields to struc-
tures/objects, it was decided to manually add a lock field to every transactional object
on the application code. Although this is not suitable for a production use it is good
enough to realize some experiments and evaluate the performance.
3.6 Safety Improvements
Next we describe our new quiescing algorithm. Unlike the original algorithm from
TL2, this one guarantees that transactions run in consistent states even when memory
is entering and leaving the transactional space.
3.6.1 New Quiesce Algorithm
Transactional variables must only be accessed by transactions, however it is desirable
that variables leaving (being freed) the transactional space can be re-used by non trans-
actional operations. To handle this transition the variables must be quiesced before
they are freed (example in Figure 3.16). In TL2 the quiesce function is named TxSteril-
ize.
Quiescing on the original TL2 consisted of waiting for the writes to be drained and
the lock to be released. This, however, fails under some circumstances. Consider the
situation where we have a list with two nodes (A and B) and three threads (TX1, TX2
and NT3) are using it. TX1 is looking for node B, TX2 is deleting node B, NT3 is not
running any transaction. The following sequence of events takes place:
1. TX1: starts (with transaction timestamp 10) and looks up node A, which is prior to
node B, reads the pointer to node B.
46
3. PROTOTYPE DEVELOPMENT 3.6. Safety Improvements
1 / / a l l o c a t e memory
2 new memory=malloc ( s iz eof ( i n t ) ) ;
3 Tx Star t ( t ) ;
4 TxStore ( t , &a , new memory ) ;
5 TxCommit ( t ) ;
6
7 / / r e l e a s e memory
8 Tx Star t ( t ) ;
9 memory to free=a ;
10 TxStore ( t , &a , NULL) ;
11 TxCommit ( t ) ;
12 T x S t e r i l i z e ( t , memory to free ) ;
Figure 3.16: Creating and releasing a transactional variable (simplified)
2. TX2: starts, (with transaction timestamp 10) looks up node B and removes all ref-
erences to node B.
3. TX2: commits (increments the global version clock to 12).
4. TX2: quiesces node B (no thread is currently locking B).
5. TX2: frees node B.
6. NT3: starts, calls malloc and receives a pointer to the same memory where node B
was previously referenced.
7. NT3: writes to that memory.
8. TX1: follows the pointer to the late node B and reads its contents. At this point
TX1 is reading memory already recycled for usage of another thread. Hence TX1
is running on an inconsistent state, which violates the idea of never seeing incon-
sistent memory states.
The newly designed algorithm allows memory to be recycled and to always run in
consistent states. The solution found was for the quiesce operation to treat the delete
as a regular write. Quiesce updates the lock version to the current value global version
clock and since it is always called after the commit and before the free, updating the
node version number will invalidate any future reads to it by other transactions. Qui-
esce does not need to increment the value of the global version clock because it was
already incremented when the commit was done (quiesce can’t be called inside an ac-
tive transaction). With the new quiesce, the previous operations are transformed into
the following, where the step 8 no longer sees the inconsistent state:
47
3. PROTOTYPE DEVELOPMENT 3.7. Debugging enhancements
1. TX1: starts (with transaction timestamp 10) and looks up node A, which is prior to
node B, reads a pointer to node B.
2. TX2: starts (with transaction timestamp 10), looks up node B and removes all ref-
erences to node B.
3. TX2: commits (increments the global version clock to 12).
4. TX2 quiesces node B (node B’s lock is set to 12).
5. TX2: frees node B.
6. NT3: starts, calls malloc and receives a pointer to the same memory where node B
was previously referenced.
7. NT3: writes to that memory.
8. TX1: follows the pointer to the late node B and verifies that B’s lock is greater (12)
that its own transaction timestamp (10). TX2 aborts and the user code never sees
an inconsistent state.
Even if some other transaction incremented the clock between step 3 and 4, the result
would be the same.
Apart from solving the above problem, the new quiesce function also allows recy-
cling of memory when using the undo log mode (described in Section 3.5.1), which
could not be done on TL1 [DS06].
3.7 Debugging enhancements
This section describes the new tracing engine implemented to facilitate the debugging
of the STM engine.
3.7.1 Event Tracing
The debugging of the STM engine is a complex task. The amount of concurrency of
transactional applications is at the level of very fine grained locking (one lock per vari-
able), therefore a huge number of interleavings is allowed. With this scenario, tradi-
tional debuggers are of little use, not only they interfere with the run, reducing or even
eliminating concurrency, but they also don’t record the interleavings of actions. When
running the transactional code in a debugger like GDB, the concurrency problems re-
lated to certain interleavings, such as the one discussed in Section 3.6.1, typically dis-
appear.
48
3. PROTOTYPE DEVELOPMENT 3.7. Debugging enhancements
To effectively observe these problems it is necessary to: i) allow the transactions
to run with minimal intrusion—measured, not only in terms of overhead on the local
thread but also on the size of the locked section—a smaller locked section means that
other threads wait a smaller amount of time for the lock to be released; ii) check pro-
gram invariants while the transactions are being run; and iii) be able to observe the
past actions of the transactions, and the order of their occurrence.
Tracing Levels
To ease the debugging of the STM engine we have developed a minimal tracing engine
that selectively records events in three different layers: application layer, transactional
layer; and lock and data layer.
• The application layer information is important to know which piece of code of
the application was running. Examples of such information are: “insert element
X on the list L”.
• The transactional layer information is important to know which step of the trans-
action was being run. It logs which transactional primitives (TxStart, TxCommit,
TxAbort, TxLoad, TxStore) are being run and the stage of the operation. Some ex-
amples are: executing a transactional load of address X - checking whether variable is
locked; executing a transactional store - variable is already lock by another transaction -
going to abort.
• The lock and data layer information is the deepest layer of tracing. It is important
when the information on the upper layers is insufficient. It traces information
related to the acquisition/release of locks, and to the reads and writes of shared
variables.
Tracer Internal Structure
To store the information in order of occurrence (total order) we have used a circular
buffer of events, where each buffer item holds one event. The access to the buffer is
synchronized among all transactions with a CAS instruction on the next buffer item
variable. When a transaction needs to add an event to the trace, it first atomically
increments the buffer index and then writes the buffer element. When the event buffer
is full the next buffer item variable is reset to the first position of the buffer and the
oldest events are overwritten.
The usage of the CAS instruction, instead of a lock, was motivated on the need to
reduce the overhead and contention of the tracing facility. Using the CAS instruction,
49
3. PROTOTYPE DEVELOPMENT 3.7. Debugging enhancements
the exclusivity time is only one CPU instruction. If transactions T1 and T2 are trying
to trace an event simultaneously and the CAS of T1 succeeds, then T2 only has to wait
until the CAS of T1 finishes. If, instead, the tracer used lock/read/modify/unlock,
then it would have at the very least four CPU instructions of exclusivity—acquire the
lock, read the pointer, write the new pointer position, release the lock. With a smaller
exclusive section size, the other threads wait, in average, a smaller amount of time
for each other. However this architecture does not prevent a high number of cache
conflicts on the next buffer item variable—if many transactions are using the tracing
facility and each transaction runs on a separate CPU, there is a high probability that
the next buffer item variable, will constantly be switching between processor caches.
The tracing facility can trace the following type of information: thread identifier,
type of event, step and three optional arguments. The thread identifier is a number that
uniquely identifies the thread. The type of event is an enumeration that identifies the
event layer and the operation being performed on that layer. Each operation is com-
posed of one or more steps, the step identifies the step of the operation (e.g., within
a transactional load, step 20 identifies that the variable is unlocked). The three argu-
ments are numeric and they are operation/step specific, e.g., they may log the address
of the variable being loaded, the lock version, or any other information related to the
event being logged.
The tracing engine API is shown on Figure 3.17 together with an example of its
usage. The example shows a trace call made on the TxLoad operation. The thread
identifies itself with the UniqId, the type of the event is tl tx load, the step is number
10 and there are three arguments provided: the address of the read variable, its value
and the version of the variable lock.
1 void TraceEvent ( i n t thread id , i n t step ,
2 enum TraceEventName type ,
3 i n t p t r t const v o l a t i l e ∗addr ,
4 i n t p t r t arg1 , i n t p t r t arg2 ) ;
5
6 TraceEvent ( S e l f−>UniqID , 10 , t l t x l o a d , addr , value , vers ion ) ;
Figure 3.17: API and sample usage of the Tracing Engine
Naturally, the tracer can be completely disabled for performance tests. This is done
with a preprocessor definition, which is enabled or disabled at compile time, thus re-
ducing the tracer intrusion to zero when the tracer is disabled.
50
3. PROTOTYPE DEVELOPMENT 3.7. Debugging enhancements
Tracer Output
When one program invariant is known to be broken (assertion failure, segmentation
fault, etc), the tracing facility is instructed to dump its contents to a file. The event
buffer can also be dumped to a file using the core file and a script for GDB which is
available along with the source code.
The trace log file is a simple text file that can be imported by a spreadsheet program
like Microsoft Excel or Open Office Calc and display it in tabular format. Using the
spreadsheet program’s capabilities, the event can be filtered by type or by thread.
There is still a large room for improvement on the tracing facility. A few interesting
and useful improvements would be for the tracer to allow other types of arguments
(e.g., strings); have automatic race condition detection (e.g. detect a transaction that
writes to an unlocked variable); graphical display of the events and their interleavings;
creation of a dependency graph of the events; having advanced filtering capabilities;
etc.
51
[This page was intentionally left blank]
Chapter 4
Testing STM Implementations
This Chapter describes some of the problems found while implementing the changes
in the STM engine and synthesizes the tests used to reproduce these problems.
53
4. TESTING STM IMPLEMENTATIONS 4.1. Introduction
4.1 Introduction
The experimental work of porting, extending and testing the TL2 STM engine, caused
several complex bugs to show up. From the experience acquired in these experiments
we report some of the bugs that we have faced and we synthesize a few testing patterns
which aid at finding and reproducing these and other erroneous behaviors.
Parts of this chapter were published in [LC07].
4.2 Terminology
Before showing a sample of the problems found during the development of our ver-
sion of the STM engine, we describe the terminology used in the examples. Figure 4.1
describes the operations made by the transactions at the transactional level: TxStart,
TxLoad, TxStore, TxCommit, TxAbort and TxSterilize. Figure 4.2 describes the operations
made by the STM engine at the lock and data level: read variable’s lock version, read
variable’s value, etc.
Symbol Meaning
TxStart() Start transaction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TxCommit() Commit transaction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TxAbort() Abort transaction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TxLoad(x) Transactional operation to read the value of variable x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TxStore(x, a) Transactional operation to write the value a to variable variable x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TxSterilize(x) Transactional operation to quiesce variable x before it is released
(freed)
Figure 4.1: Transaction Construct Glossary—Transactional Level
Figure 4.3 shows the transformation of transactional-level operations (shown in Fig-
ure 4.1) into lock-level ones (listed in Figure 4.2). The transformation is simplified, as
some internal operations were omitted, like adding elements to read and write sets or
internal validation and maintenance operations. The omitted operations are not rele-
vant to the illustration of the bugs reported herein.
TxStart — In both, redo and undo log based STMs, a time-stamp will be associated
to the transaction. The time-stamp will be the value of the current global version
clock.
54
4. TESTING STM IMPLEMENTATIONS 4.2. Terminology
Symbol Meaning
a = RV (x) Read the value of transactional variable x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
v = RL(x) Read the lock version of the transactional variable x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
WV (x, a) Write the value a to the transactional variable variable x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acq(x) Acquire the lock of the transactional variable x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rel(x) Release the lock of the transactional variable x
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IncL(x) Increments the lock version of variable x to the current version of the
global clock
Figure 4.2: Transaction Construct Glossary—Lock and Data Level
TxLoad — In both, redo and undo log based STMs, loading a variable (memory lo-
cation) is, basically, a three-step operation: i) load the variable’s version counter;
ii) load the variable’s value; iii) load again the variable’s version counter. After
these steps it is checked whether that variable’s version number hasn’t changed
between the first and the third step. If it did change, the transaction must abort
immediately, otherwise the address is added to the read set. These checks are not
relevant for the bugs analyzed in this thesis and were, therefore, omitted in the
decomposition of the operation illustrated in Figure 4.3.1
TxStore — In redo log mode, changes are made out-of-place. The new variable’s value
is stored in the redo log, and no memory changes are actually made at this point.
The new value will only overwrite the original one when the transaction commits
(if it succeeds). If the commit fails then the redo log is simply discarded.
In undo log mode, changes are made in-place. Once the lock of the variable has
been acquired, the current variable’s value will be stored in the undo log and
then overwritten with the new value.
TxCommit — In redo log mode, all the pending updates should now become effective.
In this case, commit is a four-steps operation: i) acquire locks for all variables that
will be updated; ii) validate the read-set—validating the read-set ensures that all
variables read by the transaction satisfy two conditions: they are not currently
locked by any other transaction; and they were not changed since the moment
they were first read until all the write locks have been acquired; iii) apply all
1This TxLoad algorithm refers to version used before the performance improvement described in
Section 3.5.4 was applied.
55
4. TESTING STM IMPLEMENTATIONS 4.2. Terminology
Operation Decomposition for a redo log
mode
Decomposition for a undo log
mode
TxStart() ts = clock; ts = clock;
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TxCommit() Acq(all write-set); RL(all read-set);
RL(all read-set); Rel(all write-set);
WV (all write-set, new);
Rel(all write-set);
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TxAbort() WV (all write-set, old);
Rel(all write-set);
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
a = TxLoad(y) v1 = RL(y); v1 = RL(y);
a = RV (y); a = RV (y);
v2 = RL(y); v2 = RL(y);
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TxStore(x, a) Acq(x);
WV (x, a);
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
TxSterilize(x) IncL(x); IncL(x);
Figure 4.3: Simplified decomposition of transactional- into lock-level operations in
undo- and redo log mode STMs
pending changes to memory locations (overwrite the memory locations with the
values kept in the redo log); and iv) release all acquired locks, changing the ver-
sion number to the incremented version of the global clock.
In undo log mode, the locks of the overwritten variables were already acquired
in the TxStore operation and the new values were already written into the vari-
ables. It is only necessary to validate the read-set and, if successful, release all
acquired locks, changing the version number to the incremented version of the
global version clock.
TxAbort — In redo log mode, abort simply discards the redo log, so it is a null opera-
tion in what concerns to changes to shared memory locations.
In undo log mode, new values were already written in-place. So, when aborting
a transaction, the original values must be restored from the undo log, and the
previously acquired locks must be released.
56
4. TESTING STM IMPLEMENTATIONS 4.3. Sample of Bugs Found
TxSterilize — This function is called before releasing any transactional variable. Both
in undo and redo log mode, it prevents all transactions from doing any further
reads or writes to that variable. It does so by increasing the lock version of the
variable to the current global version clock number.
4.3 Sample of Bugs Found
In the following section we describe interleavings that triggered wrong behaviors in
the STM.
4.3.1 Bug 1: Reference to Non-Transactional Memory
Figure 4.4 shows a problem where transaction T1 is accessing a piece of transactional
memory already released by another transaction T2. The transactions are operating on
a list with three nodes: x, y, and z. T1 is iterating the list and reading the nodes keys.
T2 is deleting node y from the list.
T1 T2 T3 Description
1 y = TxLoad(x.n) get the pointer to
node y
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 y = TxLoad(x.n)
3 z = TxLoad(y.n)
4 TxStore(x.n, z)
5 TxCommit()
6 TxSterilize(y)
7 free(y)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 new = malloc() malloc returns the
block released in the
previous step
9 ∗new = something
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 TxLoad(y.k) read y key ...
inconsistent read
Figure 4.4: Undo/Redo log mode: Reference to Non-Transactional Memory.
The bug arose because T1 did not detect that its read set was inconsistent when
performing step 10, as the memory freed in step 7 had already been recycled for usage
on another thread (T3).
The original TxSterilize function from TL2, only waited for all writes to variable y
to drain, allowing this harmful interleaving. To correct the problem, TxSterilize was
57
4. TESTING STM IMPLEMENTATIONS 4.3. Sample of Bugs Found
changed to also increment the version lock of variable y to the current value of the
global version clock. In this way, transactions that access y after the sterilization will
detect that the version clock has been updated and will, therefore, abort.
4.3.2 Bug 2: Lost Update with Lock Collision
As depicted in Figure 4.5, commit starts by iterating the write set. If the variable is also
in the read set then it is a read/write variable, otherwise it is a write only variable. For
read/write variables, the lock version is checked against the transaction timestamp, if
it is greater the transaction aborts; for write only variables, the algorithm didn’t find
such check necessary.
1 for each i in write−s e t {
2 i f ( i i s not locked ){
3 i f ( i i s a l s o in read s e t ){
4 / / r e a d / w r i t e v a r i a b l e
5 i f ( g e t l o c k v e r s i o n ( i ) > tx timstamp )
6 abort ;
7 else
8 lock ( i )
9 } else {
10 / / w r i t e on ly v a r i a b l e
11 lock ( i )
12 }
13 }
14 }
Figure 4.5: Lock acquisition in redo log mode—buggy version
Figure 4.6 shows a possible interleaving on two transactions T1 and T2. T1 is up-
dating the write-only variable y and the read/write variable x. T2 is only updating
variable x.
When operating in redo log mode, the original TL2 algorithm failed with the in-
terleaving shown in Figure 4.6 because of a lock collision. If variable x and y have
identical hashes, there will be a lock collision and they will share the same lock in
the lock table. With this interleaving, when T1 commits, it starts acquiring the locks
according to the algorithm in Figure 4.5. The first iteration of the algorithm finds vari-
able y and since it is a write-only variable, the algorithm goes to step 11 and locks the
variable. The second iteration finds variable x and, due to the lock collision with y, it
finds the variable x to be already locked and erroneously skips any verification of the
lock version.
58
4. TESTING STM IMPLEMENTATIONS 4.3. Sample of Bugs Found
T1 T2 Description
1 TxLoad(x)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 TxStore(x, a)
3 TxCommit()
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 TxStore(y, a) y is write only
5 TxStore(x, a) x is read-write
6 TxCommit()
7 →Acq(y) lock acquisition phase
8 →Acq(x)
9 →RL(x) read set validation phase
Figure 4.6: Redo log mode: Lost Update with Small Lock Table
The correction to this problem is to always validate the lock version of read/write
variables, even if the lock is already held. Figure 4.7 shows the corrected algorithm.
1 for each i in write−s e t {
2 i f ( i i s a l s o in read s e t ){
3 / / r e a d / w r i t e v a r i a b l e
4 i f ( i i s not locked &&
5 g e t l o c k v e r s i o n ( i ) <= tx timstamp ) {
6 lock ( i ) ;
7 }
8 el se i f ( i i s locked by t h i s thread &&
9 g e t l o c k v e r s i o n ( i ) <= tx timstamp )
10 continue ;
11 el se
12 abort ( ) ;
13 } else {
14 / / w r i t e on ly v a r i a b l e
15 lock ( i ) ;
16 }
17 }
Figure 4.7: Lock acquisition in redo log mode—correct version
4.3.3 Bug 3: Dirty Read Not Invalidated when Transaction Aborts
Figure 4.8 shows an example of a hidden dirty read in undo log mode. Transaction T1
reads the variable x and commits, and transaction T2 writes to the same variable and
aborts. In step 2, T1 reads the lock version of variable x. Then T2 stores a new value
59
4. TESTING STM IMPLEMENTATIONS 4.3. Sample of Bugs Found
on x. In step 6, T1 reads (dirty read) the value of x after changed by T2. In step 7, T2
aborts and the old value of x is restored.
T1 T2 Description
1 TxLoad(x) T1 loading variable x.
2 →RL(x)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 TxStore(x, a)
4 →Acq(x)
5 →WV (x, a) new value is written
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 →RV (x) dirty value is read by T1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 TxAbort() T2 aborts
8 →WV (x, old) old x value is restored
9 →Rel(x)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 →RL(x) lock version revalidation
11 TxCommit()
Figure 4.8: Undo log mode: Dirty read not invalidated when transaction aborts
The bug in this situation was that, when transaction T2 aborted (in undo log mode)
the value of variable x was restored and the lock was simply being released with the
old version number. Transaction T1 was not detecting that it read a dirty value because
the lock version revalidation, in step 10, returned the same value as in step 2, therefore
assuming the value read in step 6 was valid. To correct this problem, when aborting a
transaction in undo log mode, the lock version of every variable in the write set must
be incremented.
4.3.4 Bug 4: Lost Update on Lock Upgrade
Figure 4.9 shows a problem that happened on undo log mode, when reading a vari-
able and then modifying its value. Transaction T1 is incrementing the variable x and
transaction T2 is storing a new value in the same variable. The problem was that TxS-
tore was not validating if the lock version was the same as the one obtained in the first
TxLoad—it was merely acquiring the lock and writing the value. The correction was to
force TxStore to validate the lock version before writing to the variable.
60
4. TESTING STM IMPLEMENTATIONS 4.4. Testing Patterns
T1 T2 Description
1 v = TxLoad(x)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 TxStore(x,w)
3 TxCommit()
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 TxStore(x, v + 1) upgrade from read access to write access
Figure 4.9: Undo log mode: Lost update on lock upgrade
4.4 Testing Patterns
Testing the STM engine aims at incrementing the probability of generating harmful in-
terleavings. Harmful interleavings are those that improperly read and/or modify the
shared data, i.e., locks and transactional variables. Such interleavings can occur dur-
ing reads, writes/updates, commits, aborts, and when adding and removing variables
from the transactional space.
Tests may target specific implementation options, such as the bug described in Sec-
tion 4.3.2, or concurrency control errors, which depend on both implementation op-
tions and execution environment.
4.4.1 Very Short Transactions
This testing pattern aims at maximizing the interleavings between the main transac-
tional operations, i.e., reads, writes, commits and aborts. Traditional transactions ex-
ecute much more load and store operations than commits and aborts. To stress test
the commit algorithm it is useful to increase the frequency of its occurrence and we
can do that by using very short transactions. To improve the efficiency of the test, it
should be used read-only, write-only and read-write variables as some engines treat
these variable in a different manner.
An example of this pattern is a list with a header and single node. The concur-
rent transactions are continuously updating the node and/or reading its contents. The
transactions are very short as there is only one node, therefore the ratio of time spent
in the commit of the transaction over the time spent in the body of the transaction is
higher than it would if the transaction was longer. We found this example useful to
find the bug reported in Section 4.3.2.
This pattern works best for redo log based STM engines. Such systems only change
the shared state on commit, shortening the time-window in which the transaction runs
with its shared state changed and concurrency errors can only be revealed once there
61
4. TESTING STM IMPLEMENTATIONS 4.4. Testing Patterns
are changes in the shared state.
4.4.2 High Frequency of Variables Being Added and Deleted
A testing pattern aiming at stressing the variability of the transactional space, by re-
peatedly inserting and removing data to/from the transactional space.
Such pattern allows to detect bugs mostly related to transactions holding pointers
to variables being simultaneously released by other transactions, such as the bug de-
scribed in Section 4.3.1. These bugs may cause invalid memory accesses and memory
being read/written after data deletion.
An example of this pattern is an implementation of a transactional list, where sev-
eral concurrent transactions are continuously adding and removing nodes.
4.4.3 High Number of Updates on a Small Number of Variables
This testing pattern aims at generating a very high frequency of collisions between
transactional reads and writes, also forcing transactions to abort very frequently. This
pattern can be instantiated with a list that can hold a very small amount of nodes (e.g.,
ten) and with several concurrent (e.g., five) transactions trying to get and update the
list nodes.
This test produced very good results with undo log strategy, because they change
the shared data on writes (locks and data), on commits (locks) and on aborts (locks and
data). Overall, this testing pattern was found to be very effective.
This pattern was particularly useful to find the bugs reported in Sections 4.3.2
and 4.3.3, as these bugs are related to read/write collisions.
4.4.4 Small Lock Table
STM engines that use a lock table to store objects/data locks, usually make use of
an hashing function to map the object address to its lock within the lock table. Such
hashing function may map several objects to the same table position, originating a
lock collision problem. Such lock collisions may cause an improper validation of the
lock state, with transactions never being able to commit and potentially running into
livelocks.
This pattern aims at maximizing the function fL = V ∗TL , where V is the average
number of transactional variables being manipulated by each transaction, T is the
number of running transactions and L is the size of (number of entries in) the lock ta-
ble. The pattern can, therefore, be instantiated by an adequate combination of i) using
62
4. TESTING STM IMPLEMENTATIONS 4.5. Conclusions
a smaller lock table; ii) increasing the number of transactional variables; and iii) in-
creasing the number of running transactions. This pattern contributed significantly to
find the bug reported in Section 4.3.4. Just by decreasing the size of the lock table to a
very small number it was possible to have a significant number of lock collisions and
reproduce the harmful interleaving.
4.4.5 More Concurrent Transactions than CPUs
If the number of transactions is less than the number of CPUs, any transactions willing
to run can be immediately assigned to a CPU, and transactions will never be stalled
waiting for CPU. In such case, some interleavings will be harder to reproduce, because
they depend on transactions being preempted and stalled for some time. Using more
threads than CPUs causes some of the threads to be preempted for large amounts of
time, potentially while holding locks.
This pattern was useful to reproduce the bug reported in Section 4.3.3.
4.5 Conclusions
These testing patterns are a simple way to find bugs in STM engines. In the case of
our prototype, they were helpful as we could find several problems just by using these
techniques.
There is a lot of work that can be done in this area specially in finding more testing
patterns.
63
[This page was intentionally left blank]
Chapter 5
Prototype Validation
This Chapter describes the tests and experiments made with the prototype and shows
the obtained results.
65
5. PROTOTYPE VALIDATION 5.1. Introduction
5.1 Introduction
During and after developing the enhancements to TL2, several tests were created and
executed to verify the engine’s functionality, stability and performance. The tests were
made with a red black tree and a sorted list test harnesses, which are two of the most
common test sets in the literature.
Functional tests were made with a simple functional test harness consisting of sev-
eral test cases, aiming at each functionality in the prototype.
Stability tests were made with the performance test harness by running several
load patterns, several number of threads, several operating systems (Linux and Solaris
X86), and several machines with different hardware configurations. Each version of
the prototype was left running to find errors. The latest version was left for more than
seven days in a row without errors.
The performance tests were aimed at validating the performance on several sce-
narios. First we compared the performance of the prototype implementation options.
Next we tested against the original TL2 implementation. Finally against another STM
engine–the Robert Ennals STM engine [Enn06]. The Ennals STM implementation was
chosen because it shows great performance results and it was also as a performance
baseline by TL2 authors.
5.2 Description of the Tests
We decided to use a test harness similar to the one used by the TL2 authors. The tests
were made with a Red Black Tree implementation based on the one found on TL2
package, which in turn is based on the java.util.TreeMap implementation. However,
several modifications were made to adapt it to use the different API of object mode
as well as the API of Ennals STM. We have also created a variant of the tests with an
implementation of a sorted list.
The tests consist on series of operations on a set. The set is either a Red Black Tree
implementation or a Sorted List implementation. Both implementations have three
methods—put, delete and get. The set elements have a key and a value and all methods
are indexed by the key. Duplicate keys are not allowed and adding an element with an
already existing key just updates its value.
The sorted list implementation is a standard double linked list created from scratch.
The insert operation places the nodes sorted by key; the get operation runs through
the list until it finds the element; the delete operation, first gets the element and then
updates the pointers of the adjacent nodes.
66
5. PROTOTYPE VALIDATION 5.2. Description of the Tests
5.2.1 Test Harness Implementation
The test harness is divided in two components: the set implementation (which may be
the sorted list or the red black tree implementation); and the harness launcher. The har-
ness launcher starts a number of threads in parallel. Each thread continuously loops
between: i) choosing an operation to execute (put, delete or get) based on a random
number; ii) calculating a random key; and iii) executing the selected operation on that
key. The probability of put/delete/get operations is chosen via a command line argu-
ment. Also the key range is passed as an argument and it limits the number of elements
the set can have, e.g., a key range of 1000 means that the set can have elements with
keys from 0 to 999, thus the maximum number of elements in the set would be 1000.
The key range largely affects the contention on the set. With a low key range, the
probability of having more than one thread reading, writing or deleting the same node
increases. Therefore the key range has a big impact on overall contention. A second
effect of the key range is that it increases the search time for an element, this is spe-
cially relevant for the list implementation, where the average search size (and time) is
proportional to the key range.
The tests have been made using the several possible combinations of the prototype:
word based mode with redo log; word based mode with undo log; object based mode
with undo log. The word based modes always run in consistent states, the object based
modes have three variants regarding consistent state validation: full state validation,
partial state validation and no state validation (see Section 2.3.5). Also the object based
mode has two options regarding lock placement: in a separate table or adjacent to the
object. In summary all tested combination are shown in Table 5.1.
The tests were made on a 2 way Intel Xeon CPU@2.66GHz—Dual Core, making
four processing cores. Although with these machines we can’t evaluate the scalability
of the selected alternatives, we can have a glimpse on its behavior on a small scale
scenario.
On these tests the STM engine, as well as the harness, were compiled for X86 32bits
on a Linux operating system with kernel version 2.6.18. The compiler used was GCC
version 4.1.2, the optimization flag used was -O3 and the debugging and profiling flags
were disabled. Our tracing engine was also disabled.
5.2.2 Test Parameters
The test parameters chosen to run the harness are basically the same used on TL2 tests.
It includes a small (200 keys) and a large set (20.000 keys). The small set represents a
high contention structure and the large set represents a low contention structure. Each
67
5. PROTOTYPE VALIDATION 5.3. Test Harness Overhead
Short name Prototype combination
word/tab/redo Word based mode with redo log, full state
validation and lock in a separate table.
word/tab/undo Word based mode with undo log, full state
validation and lock in a separate table.
object/tab/undo/fv Object based mode with undo log, full state
validation and lock in a separate table.
object/tab/undo/pv Object based mode with undo log, partial
state validation and lock in a separate table.
object/tab/undo/nv Object based mode with undo log, no state
validation and lock in a separate table.
object/adj/undo/fv Object based mode with undo log, full state
validation and lock adjacent to the object.
object/adj/undo/pv Object based mode with undo log, partial
state validation and lock adjacent to the ob-
ject.
object/adj/undo/nv Object based mode with undo log, no state
validation and lock adjacent to the object.
Table 5.1: Tested prototype combinations.
set is subject to two load patterns, one with mostly reads, other with a higher write
percentage. The first pattern has 5% puts; 5% deletes; 90% gets, which we call the read
pattern. The second pattern has 20% puts; 20% deletes; 60% gets, which we call the
write pattern. These patterns intend to show the difference in behavior with varying
proportion of reads versus writes.
The number of running threads included 1, 2, 4 and 8 threads. Until 4 threads, the
intent is to study the performance increase of the test harness with the number of avail-
able CPUs. With 8 threads, the intent is to investigate whether there is a performance
decrease by having more running threads than CPUs.
5.3 Test Harness Overhead
Before evaluating the test results, we start with the evaluation of the test harness qual-
ity in terms of overhead and impact on the test results.
In the test harness there are three layers running: the harness launcher, the set im-
plementation and the STM engine. Since we intend to verify the performance of the set
implementation using the STM engine, one requirement on the test harness launcher, is
to be as light as possible to avoid it from hiding the true subject of the test. Therefore it
is desirable that the CPU time spent in the harness launcher is significatively less than
the running time of the set implementation plus the STM engine. Figure 5.1 shows the
68
5. PROTOTYPE VALIDATION 5.3. Test Harness Overhead
percentage of time spent in the set implementation plus the STM engine, the remaining
time is spent on the harness. As it can be seen the time spent on the harness is less than
25-30%, which leaves a solid margin for testing the set implementation with the STM
engine.
69
5. PROTOTYPE VALIDATION 5.3. Test Harness Overhead
 
0
 
20
 
40
 
60
 
80
 
10
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
CPU time on STM engine (%)
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 5
%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et
w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv
(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
20
 
40
 
60
 
80
 
10
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
CPU time on STM engine (%)
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 2
0%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv
(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
20
 
40
 
60
 
80
 
10
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
CPU time on STM engine (%)
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
5%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv
(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
20
 
40
 
60
 
80
 
10
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
CPU time on STM engine (%)
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
20
%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et wo
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv
(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
1:
Pe
rc
en
ta
ge
of
ti
m
e
sp
en
ti
n
th
e
ha
rn
es
s
70
5. PROTOTYPE VALIDATION 5.3. Test Harness Overhead
Another useful requirement on the test harness is for the time per iteration spent
in the harness launcher to be constant. This eases the speedup analysis and shows the
real performance difference of the test subject between the various test runs. If, for
instance, there was a consistent cache collision effect (e.g., false sharing) on the test
harness, the time per iteration spent on the harness would increase with the number
of CPUs and it would hide the real performance increase/decrease of the overall test.
Therefore, a good quality harness should have a constant time per iteration. When this
test harness was built, it was ensured that there were no read-write variables shared
among threads (to avoid cache coherency traffic on the shared BUS) and the local vari-
able were properly padded to avoid false sharing.
Figure 5.2 shows the time spent on the harness launcher per operation. As can
be seen, the time is stable at 450 nanoseconds with up to four threads. With more
than four threads the results are not significative because the time measurement is
made between the start and the end of the harness operation. If some other thread
preempts the CPU while the harness is running, the whole scheduler quantum of the
other thread is counted as harness time. When there are more threads running than
available CPUs this situation become a lot more frequent, specially when transactions
are longer, which is the case on Figures 5.2c and 5.2d.
71
5. PROTOTYPE VALIDATION 5.3. Test Harness Overhead
 
0
 
20
0
 
40
0
 
60
0
 
80
0
 
10
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
harness time/operation (nanosecs)
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 5
%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv
(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
20
0
 
40
0
 
60
0
 
80
0
 
10
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
harness time/operation (nanosecs)
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 2
0%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et wo
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv
(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
20
0
 
40
0
 
60
0
 
80
0
 
10
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
harness time/operation (nanosecs)
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
5%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et wo
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv
(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
20
0
 
40
0
 
60
0
 
80
0
 
10
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
harness time/operation (nanosecs)
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
20
%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv
(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
2:
Te
st
H
ar
ne
ss
La
un
ch
er
O
ve
rh
ea
d
72
5. PROTOTYPE VALIDATION 5.4. Test Execution
5.4 Test Execution
We now describe the executed tests and evaluate the results. Our benchmarks allow
us to evaluate the performance of the combinations of our prototype, compare the per-
formance against another STM engine and evaluate the gains achieved by the changes
we made. Regarding the prototype combinations we compare the undo against the
redo log strategies; we evaluate the cost of doing consistent state validation in all three
modes—full validation, partial validation and no validation; we evaluate the difference
between having a lock table and the lock adjacent to the object; and we evaluate the
performance when working in the different block sizes.
We did not execute benchmarks of our prototype against an implementation using
locks. Several benchmarks comparing the performance of STM engines against several
types of locks (coarse grained, fine grained, spin locks, Mellor Crummey and Scott
locks [MCS91], etc) can be found on [DS06, DON06, DS07, SATH+06, HF03].
5.4.1 Comparing Undo/Redo, Word/Object modes
The first set of tests (Figure 5.3) compares the performance of undo/redo log strategies
and word/object based modes.
73
5. PROTOTYPE VALIDATION 5.4. Test Execution
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 5
%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/pv
(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 2
0%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et wo
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/pv
(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
5%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et wo
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/pv
(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
20
%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/pv
(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
3:
Ev
al
ua
ti
on
of
im
pl
em
en
ta
ti
on
al
te
rn
at
iv
es
-R
ed
Bl
ac
k
Tr
ee
74
5. PROTOTYPE VALIDATION 5.4. Test Execution
These tests were performed using the three combinations of our prototype. The
Object based STM tests were performed with partial validation.
All tests show that word based/undo log and word based/redo log strategies have
similar performance under all scenarios, although the undo log scheme generally has
a short advantage.
Tests also show that the object mode with partial validation out-performs the others.
This is due to the lower overhead of object mode, where there is only one transactional
operation per node instead of one per field and the reduced number of state valida-
tions.
When running with more threads (8) than CPUs (4), the performance is not affected.
With the list implementation the results (Figure 5.4) also show an advantage of the
object mode with partial state validation.
5.4.2 The cost of consistent state validation
The second set of tests evaluates the cost of consistent state validation in undo log/ob-
ject mode. Figure 5.5 compares the performance of the three alternatives—full valida-
tion, partial validation and no validation. The graphics show that full validation has 5% to
20% less performance than the other combinations.
The tests also indicate that the relative cost of full validation decreases with the
number of threads. Although the performance difference between full validation and
no validation is higher with four threads in absolute terms, the relative difference is
smaller. The reason is that with consistent state validation, transactions detect incon-
sistent states sooner, thus they do less useless work. This is confirmed by the abort
time graphics on Figure 5.6.
Aborts may occur during a transactional load, store, during a verification, during
a commit or while handling a fault (e.g., segmentation fault). It is preferable that a
transaction aborts early when the transaction starts running rather than later, when
the transaction has already made a lot of work. In terms of performance, the worst
possible time for a transaction to abort is at commit time because it has consumed
all resources it needed, but unfortunately it finished in a state where it can’t commit.
The graphics on Figure 5.6 shows the percentage of aborts at commit time. In the no
validation combination, the percentage of commit time aborts is always higher than
80%, whereas on partial validation and full validation the number of commit time aborts
is always lower than 40%.
Having much more commit time aborts explains why the no validation combina-
tion is overtaken by the partial validation combination on the small tree from 2 threads
onwards. When there is only one thread, the no validation combination has the best
75
5. PROTOTYPE VALIDATION 5.4. Test Execution
 0
 500
 1000
 1500
 2000
 1  2  3  4  5  6  7  8
10
00
 o
pe
ra
tio
ns
/s
ec
number of threads
Sorted List / 200 keys / 5% put, 5% del, 90% get
word/tab/redo
word/tab/undo
obj/tab/pv
(a) Sorted lists / 200 keys / 5%put 5%del 90%get
 0
 500
 1000
 1500
 2000
 1  2  3  4  5  6  7  8
10
00
 o
pe
ra
tio
ns
/s
ec
number of threads
Sorted List / 200 keys / 20% put, 20% del, 60% get
word/tab/redo
word/tab/undo
obj/tab/pv
(b) Sorted lists / 200 keys / 20%put 20%del 60%get
Figure 5.4: Evaluation of implementation alternatives - Sorted List
76
5. PROTOTYPE VALIDATION 5.4. Test Execution
performance, but despite having less overhead, it does more useless work when there
is more than one thread, leading to a poorer performance.
77
5. PROTOTYPE VALIDATION 5.4. Test Execution
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
(O
BJ
) / 
20
0 k
ey
s /
 5%
 pu
t, 5
% 
de
l, 9
0%
 ge
t ob
j/ta
b/n
v
o
bj/
tab
/pv
o
bj/
tab
/fv
(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
(O
BJ
) / 
20
0 k
ey
s /
 20
% 
pu
t, 2
0%
 de
l, 6
0%
 ge
t ob
j/ta
b/n
v
o
bj/
tab
/pv
o
bj/
tab
/fv
(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
(O
BJ
) / 
20
.00
0 k
ey
s /
 5%
 pu
t, 5
% 
de
l, 9
0%
 ge
t ob
j/ta
b/n
v
o
bj/
tab
/pv
o
bj/
tab
/fv
(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
(O
BJ
) / 
20
.00
0 k
ey
s /
 20
% 
pu
t, 2
0%
 de
l, 6
0%
 ge
t
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
5:
C
on
si
st
en
ts
ta
te
va
lid
at
io
n
al
te
rn
at
iv
es
-R
ed
Bl
ac
k
Tr
ee
78
5. PROTOTYPE VALIDATION 5.4. Test Execution
 
0
 
20
 
40
 
60
 
80
 
10
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
% of commit time aborts
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 5
%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et
o
bj/
tab
/nv
 - c
mt
 ab
ort
s
o
bj/
tab
/pv
 - c
mt
 ab
ort
s
o
bj/
tab
/fv
 - c
mt
 ab
ort
s
(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
20
 
40
 
60
 
80
 
10
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
% of commit time aborts
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 2
0%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
o
bj/
tab
/nv
 - c
mt
 ab
ort
s
o
bj/
tab
/pv
 - c
mt
 ab
ort
s
o
bj/
tab
/fv
 - c
mt
 ab
ort
s
(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
20
 
40
 
60
 
80
 
10
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
% of commit time aborts
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
5%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et
o
bj/
tab
/nv
 - c
mt
 ab
ort
s
o
bj/
tab
/pv
 - c
mt
 ab
ort
s
o
bj/
tab
/fv
 - c
mt
 ab
ort
s
(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
20
 
40
 
60
 
80
 
10
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
% of commit time aborts
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
20
%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
o
bj/
tab
/nv
 - c
mt
 ab
ort
s
o
bj/
tab
/pv
 - c
mt
 ab
ort
s
o
bj/
tab
/fv
 - c
mt
 ab
ort
s
(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
6:
A
bo
rt
Ti
m
e
79
5. PROTOTYPE VALIDATION 5.4. Test Execution
The results with sorted lists are clearly different. With a sorted list, the no validation
scheme is always faster as shown on Figure 5.7. Here the validation overhead drives
the result of the test, as the number of operations per transaction is much larger than
on the red black tree. The number of operations is O(k), where k is the key range,
whereas on the red black tree the number of operations is O(log(n)).
 0
 500
 1000
 1500
 2000
 1  2  3  4  5  6  7  8
10
00
 o
pe
ra
tio
ns
/s
ec
number of threads
Sorted List / 200 keys / 5% put, 5% del, 90% get
obj/tab/nv
obj/tab/pv
obj/tab/fv
(a) Sorted lists / 200 keys / 5%put 5%del 90%get
 0
 500
 1000
 1500
 2000
 1  2  3  4  5  6  7  8
10
00
 o
pe
ra
tio
ns
/s
ec
number of threads
Sorted List / 200 keys / 20% put, 20% del, 60% get
obj/tab/nv
obj/tab/pv
obj/tab/fv
(b) Sorted lists / 200 keys / 20%put 20%del 60%get
Figure 5.7: Cost of validation - Sorted List
80
5. PROTOTYPE VALIDATION 5.4. Test Execution
5.4.3 Lock adjacent to the data
This test was made on the object based/undo log mode and it evaluates the perfor-
mance gain of having the lock adjacent to the object, versus having a lock table.
The benchmarks with the lock adjacent to the data (Figure 5.8) show a small per-
formance improvement of less than 10% over the lock table. This result was confirmed
with the tests made on the sorted list implementation shown on Figure 5.9.
81
5. PROTOTYPE VALIDATION 5.4. Test Execution
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 5
%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et
o
bj/
ad
j/n
v
o
bj/
tab
/nv
o
bj/
ad
j/p
v
o
bj/
tab
/pv
o
bj/
ad
j/fv
o
bj/
tab
/fv
(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 2
0%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
o
bj/
ad
j/n
v
o
bj/
tab
/nv
o
bj/
ad
j/p
v
o
bj/
tab
/pv
o
bj/
ad
j/fv
o
bj/
tab
/fv
(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
5%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et
o
bj/
ad
j/n
v
o
bj/
tab
/nv
o
bj/
ad
j/p
v
o
bj/
tab
/pv
o
bj/
ad
j/fv
o
bj/
tab
/fv
(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
20
%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
o
bj/
ad
j/n
v
o
bj/
tab
/nv
o
bj/
ad
j/p
v
o
bj/
tab
/pv
o
bj/
ad
j/fv
o
bj/
tab
/fv
(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
8:
A
dj
ac
en
tL
oc
k
vs
Lo
ck
Ta
bl
e
-R
ed
Bl
ac
k
Tr
ee
82
5. PROTOTYPE VALIDATION 5.4. Test Execution
 0
 500
 1000
 1500
 2000
 1  2  3  4  5  6  7  8
10
00
 o
pe
ra
tio
ns
/s
ec
number of threads
Sorted List / 200 keys / 5% put, 5% del, 90% get
obj/adj/nv
obj/tab/nv
obj/adj/pv
obj/tab/pv
obj/adj/fv
obj/tab/fv
(a) Sorted lists / 200 keys / 5%put 5%del 90%get
 0
 500
 1000
 1500
 2000
 1  2  3  4  5  6  7  8
10
00
 o
pe
ra
tio
ns
/s
ec
number of threads
Sorted List / 200 keys / 20% put, 20% del, 60% get
obj/adj/nv
obj/tab/nv
obj/adj/pv
obj/tab/pv
obj/adj/fv
obj/tab/fv
(b) Sorted lists / 200 keys / 20%put 20%del 60%get
Figure 5.9: Adjacent Lock vs Lock Table - Sorted List
5.4.4 Different Block Sizes
Some STM engines [SATH+06] use this strategy to reduce the number of locks in the
system and reduce the overhead of locking several variables. We have implemented
this strategy in our prototype, but the results are disappointing. As shown on Fig-
ure 5.10 the results are nearly the same, there is no performance difference between
them. If, for instance, an object fits inside a block, a first write to that object will lock
83
5. PROTOTYPE VALIDATION 5.4. Test Execution
the entire block, a second write will find the block already locked and doesn’t need to
lock it again. However, the STM engine still has to verify if the lock is held, therefore
the performance advantage is not so big to be noticeable.
One change that could take advantage of the bigger block size would be to have a
runtime log filter as described in [HPST06]. The optimization described there, reduces
the number of entries in the read and write set by detecting duplicate entries if, for
instance, a variable is loaded twice. With this filtering optimization and a bigger block
size, the size of the read and write sets could be further reduced, and it could eventu-
ally reduce the read-set validation overhead, speeding up the transactions. This may
be an interesting line of investigation.
84
5. PROTOTYPE VALIDATION 5.4. Test Execution
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 5
%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et
w
o
rd
/ta
b/
un
do
/8
by
te
s

w
o
rd
/ta
b/
un
do
/1
6b
yt
es


w
o
rd
/ta
b/
un
do
/3
2b
yt
es


w
o
rd
/ta
b/
un
do
/6
4b
yt
es


(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 2
0%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
w
o
rd
/ta
b/
un
do
/8
by
te
s

w
o
rd
/ta
b/
un
do
/1
6b
yt
es


w
o
rd
/ta
b/
un
do
/3
2b
yt
es


w
o
rd
/ta
b/
un
do
/6
4b
yt
es


(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
5%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et
w
o
rd
/ta
b/
un
do
/8
by
te
s

w
o
rd
/ta
b/
un
do
/1
6b
yt
es


w
o
rd
/ta
b/
un
do
/3
2b
yt
es


w
o
rd
/ta
b/
un
do
/6
4b
yt
es


(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
20
%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
w
o
rd
/ta
b/
un
do
/8
by
te
s

w
o
rd
/ta
b/
un
do
/1
6b
yt
es


w
o
rd
/ta
b/
un
do
/3
2b
yt
es


w
o
rd
/ta
b/
un
do
/6
4b
yt
es


(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
10
:C
om
pa
ri
ng
di
ff
er
en
tb
lo
ck
si
ze
s.
85
5. PROTOTYPE VALIDATION 5.4. Test Execution
5.4.5 Comparing the performance against the ported TL2
As said before, the original TL2 was developed for Sun Solaris with SPARC architecture
and using the SUN Pro C compiler. During this thesis we had no chance of testing the
prototype with this platform, this is why we have ported TL2 to Linux/X86/GCC.
This thesis was born after the port to our platform, however we have kept the ported
version without additional changes to serve for baseline performance analysis.
In this set of tests we have compared the performance of the ported TL2 (the ver-
sion with the minimal set of changes necessary to run on the available configuration)
against the fully modified prototype of this thesis. It is therefore, unfair to say that
the comparison is against the original TL2 because of the necessary changes to make it
work on our platform. The changes made to the ported TL2 were described in the Sec-
tions 3.3.1 and 3.3.2—X86 assembly instructions; removal of the schedctl mechanism;
and replacement of the non-faulting load instructions with fault handlers.
Figure 5.11 shows the test against the ported TL2. There is a performance improve-
ment seen on all the combinations.
The test also shows that the ported TL2 is less resilient to overload—when running
the prototype with more threads than available CPUs. On all tests, the performance
of the ported TL2 drops 10-20%, whereas on our prototype, the performance drop is
negligible on any of the tested combinations. The reason is related to the reduced cache
coherency traffic on the bus due to the improved TxLoad algorithm, in which reads the
shared lock only once.
86
5. PROTOTYPE VALIDATION 5.4. Test Execution
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 5
%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv TL
2
(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 2
0%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et wo
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv TL
2
(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
5%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et wo
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv TL
2
(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
20
%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv TL
2
(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
11
:P
or
te
d
TL
2
vs
im
pr
ov
ed
pr
ot
ot
yp
e
87
5. PROTOTYPE VALIDATION 5.4. Test Execution
5.4.6 Comparing the performance against Ennals STM
On this test we verified the performance of our prototype against one of the top per-
forming STM engines to date. To prepare these tests we had to adapt our harness to be
used by Ennals implementation, therefore these tests are significantly different from
the ones used by Ennals on [Enn06] and TL2 on [DON06]. Ennals STM code had to
do some hard coded checks in the engine to be able to use the original red black tree
implementation from [HF03]. We decided to remove those hard coded checks since
they were useless with our test harness.
This test results are shown in Figure 5.12. We can see Ennals STM having a close
but higher performance than our prototype. The closest combination in terms of per-
formance is the object mode/lock adjacent to the data/no validation/undo log mode. In fact this
combination is the closest to the Ennals STM implementation—still, with this combi-
nation, some algorithmic differences subsist between Ennals STM and our prototype,
namely Ennals uses a specialized malloc/free implementation whereas our prototype
uses standard malloc/free; and Ennals uses a standard version write lock, whereas our
prototype uses the global version clock algorithm (although it is not used for consis-
tent state validation, it is required to mark deleted/free’d objects). We consider the
performance difference to be acceptable (less than 5%), considering that our prototype
does not need a special malloc/free implementation.
When running with more threads than CPUs, the performance of Ennals STM drops
significantly and is overtaken by our prototype, which maintains the same perfor-
mance level. This effect on Ennals STM may be due to the inexistence of a backoff
mechanism. This mechanism, which is available in TL2, reduces contention by throt-
tling down transactions when they try to access the same variable. If using the backoff
mechanism, when a transaction tries to access a variable that is locked by another trans-
action, it rolls back the changes it made and then waits a certain period of time before
retrying.
88
5. PROTOTYPE VALIDATION 5.4. Test Execution
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 5
%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 2
0%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et wo
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
5%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et wo
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
50
0
 
10
00
 
15
00
 
20
00
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
20
%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
12
:E
nn
al
s
vs
im
pr
ov
ed
pr
ot
ot
yp
e
89
5. PROTOTYPE VALIDATION 5.4. Test Execution
5.4.7 Cache coherency problems
One important topic when trying to create a good performance multi-threaded pro-
gram is the cache coherency traffic. Most common shared memory hardware architec-
tures use a shared bus among all CPUs. The bus has several uses, two of which are:
reading and writing data to the main memory; and for the CPUs to execute the cache
coherency protocol. This protocol maintains a consistent view of the main memory
among all memory caches in situations where two or more CPU are handling the same
memory address.
If, for example, the value of a memory address is stored on the cache of two or more
CPUs and one of them writes to that address, the CPU must inform the others that the
value on their caches is no longer up-to-date (invalid) [HP96]. Therefore, when the
other CPUs need to access the same address a cache miss will occur (even tough the
address is in the cache, it is no longer up-to-date), causing extra traffic on the shared
bus and delaying the execution.
In this test we have created a shared read/write variable that is read and written
for every iteration of the harness and it occupies the size of one cache line, the shared
variable was the random number generator seed. The test is the same as the one on
Section 5.4.6 and the results are shown on Figure 5.13.
90
5. PROTOTYPE VALIDATION 5.4. Test Execution
 
0
 
10
0
 
20
0
 
30
0
 
40
0
 
50
0
 
60
0
 
70
0
 
80
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 5
%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et
w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
10
0
 
20
0
 
30
0
 
40
0
 
50
0
 
60
0
 
70
0
 
80
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 2
0%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
10
0
 
20
0
 
30
0
 
40
0
 
50
0
 
60
0
 
70
0
 
80
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
5%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
10
0
 
20
0
 
30
0
 
40
0
 
50
0
 
60
0
 
70
0
 
80
0
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
20
%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et wo
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
13
:H
ig
h
ca
ch
e
co
he
re
nc
y
tr
af
fic
on
th
e
bu
s
91
5. PROTOTYPE VALIDATION 5.4. Test Execution
In this test, the performance increase with the number of threads is much lower
(on all test combinations) than it was on previous tests. With eight threads, the per-
formance of our prototype is maintained, but the performance of the ported TL2 and
Ennals STM drops acutely. The difference is related to the improved TxLoad algorithm
and to the backoff contention reduction algorithm. Ennals STM does not have a con-
tention reduction algorithm—when a transaction aborts it immediately retries. TL2
used an algorithm which spined for an exponential amount of time on abort, whereas
our prototype uses an algorithm similar to TL2 but it yields the CPU to other threads
instead of doing a busy wait like TL2 does. Further study must be made to investigate
the impact of the contention reduction algorithm.
5.4.8 STM engine overhead
One way to evaluate the overhead of the STM engine is to evaluate its performance
against a non synchronized version of the algorithm. In this test, all the transactional
primitives were removed from the test harness, therefore there is no logging, locking,
nor validation. Naturally, the non synchronized version can only run with one thread.
In Figure 5.14, it is shown the performance of all combinations and the non syn-
chronized red black tree version, which is identified as VoidSTM. All tests where run
with one thread only and as expected, the non synchronized version outperforms all
others, where it achieves a performance that is between 1.5 and 3 times higher than the
others. The performance of the non synchronized version is only beaten by the STM
version running with 2 threads. There is still a big overhead and with a small number
of CPUs and the purely sequential version is a respectable adversary.
92
5. PROTOTYPE VALIDATION 5.4. Test Execution
 
0
 
20
0
 
40
0
 
60
0
 
80
0
 
10
00
word/tab/undo
obj/tab/nv
obj/tab/pv
obj/tab/fv
obj/adj/nv
obj/adj/pv
obj/adj/fv
TL2
Ennals
VoidSTM
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 5
%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et
(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
20
0
 
40
0
 
60
0
 
80
0
 
10
00
word/tab/undo
obj/tab/nv
obj/tab/pv
obj/tab/fv
obj/adj/nv
obj/adj/pv
obj/adj/fv
TL2
Ennals
VoidSTM
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 2
0%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
20
0
 
40
0
 
60
0
 
80
0
 
10
00
word/tab/undo
obj/tab/nv
obj/tab/pv
obj/tab/fv
obj/adj/nv
obj/adj/pv
obj/adj/fv
TL2
Ennals
VoidSTM
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
5%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et
(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
20
0
 
40
0
 
60
0
 
80
0
 
10
00
word/tab/undo
obj/tab/nv
obj/tab/pv
obj/tab/fv
obj/adj/nv
obj/adj/pv
obj/adj/fv
TL2
Ennals
VoidSTM
1000 operations/sec
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
20
%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
14
:S
T
M
en
gi
ne
ov
er
he
ad
93
5. PROTOTYPE VALIDATION 5.4. Test Execution
5.4.9 Speedup Analysis
Another important measure of the prototype’s performance is the speedup. Figure 5.15
shows that the speedup is almost linear on the read patterns but still very good on
the write patterns. The combinations that don’t do consistent state validation have
the lowest speedup as they waist more time running in inconsistent states. Also the
speedup is negative with more threads than CPUs for Ennals and the ported TL2.
94
5. PROTOTYPE VALIDATION 5.4. Test Execution
 
0
 
1
 
2
 
3
 
4
 
5
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
Speed up
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 5
%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et
w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(a
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
1
 
2
 
3
 
4
 
5
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
Speed up
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
00
 k
ey
s 
/ 2
0%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et
w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(b
)R
ed
Bl
ac
k
Tr
ee
/
20
0
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
 
0
 
1
 
2
 
3
 
4
 
5
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
Speed up
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
5%
 p
ut
, 5
%
 d
el
, 9
0%
 g
et w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(c
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
5%
pu
t5
%
de
l9
0%
ge
t
 
0
 
1
 
2
 
3
 
4
 
5
 
1
 
2
 
3
 
4
 
5
 
6
 
7
 
8
Speed up
n
u
m
be
r o
f t
hr
ea
ds
R
ed
 B
la
ck
 T
re
e 
/ 2
0.
00
0 
ke
ys
 / 
20
%
 p
ut
, 2
0%
 d
el
, 6
0%
 g
et w
o
rd
/ta
b/
re
do
w
o
rd
/ta
b/
un
do
o
bj/
tab
/nv
o
bj/
tab
/pv
o
bj/
tab
/fv
o
bj/
ad
j/n
v
o
bj/
ad
j/p
v
o
bj/
ad
j/fv TL
2
En
na
ls
(d
)R
ed
Bl
ac
k
Tr
ee
/
20
.0
00
ke
ys
/
20
%
pu
t2
0%
de
l6
0%
ge
t
Fi
gu
re
5.
15
:S
TM
en
gi
ne
sp
ee
du
p
95
5. PROTOTYPE VALIDATION 5.5. Conclusions
5.5 Conclusions
The above tests show a small but consistent advantage of undo-log strategy over redo-
log. They also show that object mode has a better performance than word based mode
although the advantage fades when contention is higher.
In terms of performance, consistent state validation has a prize and a penalty, the
prize is the early detection of inconsistent states and therefore the early abort of the
transaction; the penalty is related to the additional instructions involved in validation.
The tests with partial validation show significantly better performance than full valida-
tion and in some cases it overtakes the no validation option.
The tests made with the adjacent lock show a small advantage over the tests with
lock table.
Finally, the test with the show that all prototype combinations have a close to linear
speedup.
96
Chapter 6
Conclusions
This Chapter summarizes the results of this investigation and brings out some pointers
for future directions.
97
6. CONCLUSIONS 6.1. Conclusions
6.1 Conclusions
This work has presented several implementation options for STM engines. Those op-
tions were evaluated in terms of performance and safety, always with a focus on run-
ning transactions in consistent states.
We have based our work on the TL2 implementation described on [DON06]. We
started by porting TL2 implementation from Solaris SPARC 64bit to Linux X86. In the
X86 version we implemented a set of new features, performance enhancements and a
new tracing engine.
The new added features were:
• We have implemented user called transaction aborts.
• We have included an automatic transaction retry mechanism which restarts the
transaction when an inconsistent state is detected, changing the commit semantic
from “at most one” to “exactly one”.
• We have implemented transaction nesting on TL2 with partial rollbacks in undo
and redo log schemes and with object and word based modes.
We have made experiments with different design options:
• We have improved the global version clock algorithm to increase its safety fea-
tures. The original algorithm did not guarantee running in consistent states when
transactional memory was released (free’d).
• We have adapted the TL2’s global version clock algorithm to be used with the
undo logging recovery strategy.
• We have adapted TL2 to work on object based mode. This is the first STM im-
plementation, known to us, to do consistent state validation with undo logging
strategy (in either object or word based mode). We have concluded that the ob-
ject based mode with undo logging strategy may have significantly better perfor-
mance than the others. The benchmarks show that on word based mode the undo
and redo logging strategy yields similar results and that the undo log/object
mode has the best performance at the cost of a more complex validation scheme.
• We have improved the global version clock algorithm in terms of performance,
achieving a significantly lower overhead on the most used transactional method—
the transactional load; and a lower cache coherency traffic on the shared bus. The
test results show that it yields a better performance than the original algorithm,
98
6. CONCLUSIONS 6.1. Conclusions
both on regular load and on overload (more running transactions than CPUs)—
with a close to zero performance degradation when there are more threads run-
ning than CPUs.
• We have studied the performance cost of validating the consistent state, showing
that the impact can be high, specially on low contention. Therefore we propose
a partial state validation scheme, which proves to be a good option in terms of
performance and safety.
• We have studied the effect of lock placement: locks in a separate table versus
locks stored adjacent to data and we have compared their performance. Our test
results show a small performance improvement for the later over the former.
• We have implemented and made experiments with different word sizes to use the
whole cache line instead of just one word. The results don’t show any significant
performance difference.
Other Contributions were:
• We have built a very lightweight tracing engine, which was indispensable to de-
bug the STM implementation. We avoided using standard locks on the tracer
synchronization because most concurrency problems would be hidden due to
the very fine grained lock granularity of the STM engine. This tracing engine
records events in order of occurrence within an exclusive section of a single CPU
instruction. We have been able to reproduce all observed bugs using this engine.
• All these changes have been tested with two test harnesses: a red black tree and
a sorted list implementation, which have been exposed to severe test conditions
like: a list or tree with less nodes than transactions operating on it, causing a huge
number of aborts; and originating a high number of lock collisions, by using very
small lock tables (even with a single lock).
• In addition to the changes made on the prototype we have proposed a novel clas-
sification scheme for transaction states. We classified them as: updated consistent
when a transaction has seen a fully updated memory snapshot; obsolete consistent
when a transaction has seen a past memory snapshot; and inconsistent when a
transaction has seen a dirty snapshot.
• We have shown in detail several hard to find concurrency bugs in the STM imple-
mentation and we have designed a few testing patterns which aided at finding
and reproducing those bugs.
99
6. CONCLUSIONS 6.2. Future Work
• Finally, we are also the first to present a type of hazard that may occur on existing
lock based STMs that use the undo log strategy. This hazard may occur because
transactional writes may take place when a transaction is in an invalid state and
therefore the write may happen on a non-transactional variable.
6.2 Future Work
Still a lot of work can be done to improve STM engines in terms of features, perfor-
mance and integration with other applications.
An interesting area is the integration of the STM engine with a debugger. A debug-
ger could attach to a transaction and show the memory snapshot the transaction has by
having a look at the transaction log. It could hide the transactional object headers from
the user (unless requested) to reduce the debugging complexity. It could also abort a
transaction by user demand.
Another interesting work would be to create a trace visualization tool that analyzes
and displays the interleavings of the transactions. It could also replay the interleavings
recorded on a specific run.
Regarding our prototype some interesting next steps are:
• Implementation and evaluation of a redo log/object based combination.
• Benchmarks with higher capacity machines, and with different (possibly non
synthetic) test harnesses.
• Integration of the STM engine with a compiler, to avoid the overhead of the func-
tion call and enable further optimizations.
• Implementation of the retry and orElse primitives.
• Improving the transaction nesting functionality to have full support for closed
nesting.
100
Appendix A
Raw test data
This Appendix shows the raw numbers of the executed tests.
101
A. RAW TEST DATA A.1. Keywords
A.1 Keywords
The following tables show the keywords used on the test results table.
Keyword Meaning
cmd STM engine options (see bellow)
duration Test duration in 1/10 second
nthr Number of threads
pput Frequency of puts (%)
pdel Frequency of deletes (%)
pget Frequency of gets (%)
krange Key Range
total Number of operations performed
ld aborts Number of aborts detected while running TxLoad
vfy aborts Number of aborts detected while running TxVerifyAddr
st aborts Number of aborts detected while running TxStore
segf aborts Number of aborts detected due to dereferencing an invalid
pointer
cmt aborts Number of aborts detected while running TxCommit
total aborts Total number of aborts
total time Total CPU time of the test run (≈ test duration x number of
threads)
stm time CPU time spent running the STM engine
harn time CPU time spent running the test harness
Figure A.1: Test results keywords
102
A. RAW TEST DATA A.1. Keywords
Keyword Meaning
wtr Prototype running with: word based mode; lock on separate ta-
ble; and redo log mode
wtu Prototype running with: word based mode; lock on separate ta-
ble; and undo log mode
otn Prototype running with: object based mode; lock on separate ta-
ble; undo log mode; and no consistent state validation
otp Prototype running with: object based mode; lock on separate ta-
ble; undo log mode; and partial consistent state validation
otf Prototype running with: object based mode; lock on separate ta-
ble; undo log mode; and full consistent state validation
oan Prototype running with: object based mode; lock adjacent to ob-
ject; undo log mode; and no consistent state validation
oap Prototype running with: object based mode; lock adjacent to ob-
ject; undo log mode; and partial consistent state validation
oaf Prototype running with: object based mode; lock adjacent to ob-
ject; undo log mode; and full consistent state validation
TL2 Ported version of TL2
Ennals Ennals STM
Void Test harness running without synchronization primitives
Figure A.2: STM engine running modes
103
A. RAW TEST DATA A.2. Raw Data of the Red Black Tree Tests
A.2 Raw Data of the Red Black Tree Tests
104
A. RAW TEST DATA A.2. Raw Data of the Red Black Tree Tests
cm
d
du
ra
ti
on
nt
hr
pp
ut
pd
el
pg
et
kr
an
ge
to
ta
l
ld
ab
or
ts
vf
y
ab
or
ts
st
ab
or
ts
se
gf
ab
or
ts
cm
t
ab
or
ts
to
ta
l
ab
or
ts
to
ta
l
ti
m
e
st
m
ti
m
e
ha
rn
ti
m
e
w
tr
60
0
1
5
5
90
20
0
27
77
07
99
0
0
0
0
0
0
60
00
52
56
35
39
53
11
12
42
14
32
w
tr
60
0
2
5
5
90
20
0
51
18
00
27
15
94
66
0
0
0
44
40
16
39
06
12
00
00
38
3
74
84
93
76
22
72
27
99
w
tr
60
0
4
5
5
90
20
0
10
15
78
16
0
86
95
47
0
0
0
27
56
4
89
71
11
24
00
08
47
8
15
02
63
82
8
45
28
32
56
w
tr
60
0
8
5
5
90
20
0
10
13
67
58
6
85
65
65
0
0
0
26
97
6
88
35
41
47
98
91
08
8
39
00
65
20
0
45
18
04
38
w
tr
60
0
1
20
20
60
20
0
24
84
34
51
0
0
0
0
0
0
60
00
27
60
38
06
44
89
11
02
91
56
w
tr
60
0
2
20
20
60
20
0
42
35
62
40
48
10
29
0
0
0
51
23
8
53
22
67
12
00
01
88
1
82
56
31
27
18
83
02
83
w
tr
60
0
4
20
20
60
20
0
81
14
31
32
27
12
43
3
0
0
0
30
54
48
30
17
88
1
23
99
98
08
0
16
82
97
48
8
36
03
29
43
w
tr
60
0
8
20
20
60
20
0
78
84
35
29
43
24
29
7
0
0
0
70
23
66
50
26
66
3
47
99
82
59
7
41
00
51
47
9
35
23
84
26
w
tr
60
0
1
5
5
90
20
00
0
22
99
91
85
0
0
0
0
0
0
60
00
59
09
39
73
23
84
10
20
01
77
w
tr
60
0
2
5
5
90
20
00
0
43
30
38
08
20
63
0
0
0
72
21
35
11
99
95
17
0
81
80
76
58
19
20
25
87
w
tr
60
0
4
5
5
90
20
00
0
86
14
88
01
23
08
8
0
0
0
11
71
24
25
9
24
00
02
09
2
16
38
70
10
4
38
31
93
49
w
tr
60
0
8
5
5
90
20
00
0
86
10
29
31
19
44
0
0
0
0
48
0
19
92
0
47
99
06
60
5
37
56
75
21
1
52
13
88
40
w
tr
60
0
1
20
20
60
20
00
0
20
66
56
10
0
0
0
0
0
0
60
00
18
91
41
64
14
42
92
37
97
9
w
tr
60
0
2
20
20
60
20
00
0
36
66
39
77
66
09
0
0
0
88
3
74
92
12
00
04
60
4
87
55
37
86
16
28
70
41
w
tr
60
0
4
20
20
60
20
00
0
72
17
90
84
38
88
3
0
0
0
48
88
43
77
1
23
99
97
72
8
17
59
52
77
5
32
22
33
92
w
tr
60
0
8
20
20
60
20
00
0
72
28
64
32
70
97
8
0
0
0
10
63
5
81
61
3
47
99
72
78
9
40
60
37
67
0
36
41
09
04
w
tu
60
0
1
5
5
90
20
0
28
03
25
54
0
0
0
0
0
0
60
00
60
32
35
30
57
46
12
44
52
30
w
tu
60
0
2
5
5
90
20
0
51
68
39
85
17
61
91
0
61
2
0
34
89
18
02
92
12
00
04
30
8
74
44
76
12
22
93
95
46
w
tu
60
0
4
5
5
90
20
0
10
17
10
39
8
10
45
08
1
0
39
40
0
21
77
8
10
70
79
9
23
99
89
22
3
15
04
31
85
6
45
09
42
49
w
tu
60
0
8
5
5
90
20
0
10
18
48
73
2
10
56
64
8
0
41
30
0
22
10
0
10
82
87
8
47
99
32
62
2
38
97
46
79
3
45
56
12
14
w
tu
60
0
1
20
20
60
20
0
25
97
77
16
0
0
0
0
0
0
60
00
14
97
37
10
17
12
11
53
42
64
w
tu
60
0
2
20
20
60
20
0
43
84
21
95
58
00
42
0
76
44
0
40
15
3
62
78
39
11
99
99
90
3
81
35
74
36
19
45
32
33
w
tu
60
0
4
20
20
60
20
0
83
45
95
71
33
20
33
2
0
46
28
3
0
25
70
24
36
23
63
9
24
00
02
79
0
16
64
75
04
2
37
02
99
13
w
tu
60
0
8
20
20
60
20
0
82
73
75
01
38
36
97
3
0
78
21
7
0
35
06
12
42
65
80
2
47
99
90
32
7
40
70
21
39
9
36
76
36
66
w
tu
60
0
1
5
5
90
20
00
0
23
49
40
51
0
0
0
0
0
0
60
00
52
15
39
31
90
04
10
41
55
66
w
tu
60
0
2
5
5
90
20
00
0
44
12
19
25
22
72
0
6
0
50
23
28
12
00
01
14
0
81
13
63
06
19
56
83
14
w
tu
60
0
4
5
5
90
20
00
0
87
69
13
88
13
89
9
0
35
0
37
6
14
31
0
24
00
08
46
7
16
24
85
06
8
39
19
11
80
w
tu
60
0
8
5
5
90
20
00
0
87
77
84
10
20
56
9
0
43
0
47
5
21
08
7
47
98
85
18
0
37
82
54
59
2
50
26
64
02
w
tu
60
0
1
20
20
60
20
00
0
21
86
82
35
0
0
0
0
0
0
60
00
54
97
40
68
70
16
97
56
33
2
w
tu
60
0
2
20
20
60
20
00
0
38
42
85
00
77
00
0
56
0
68
3
84
39
11
99
99
67
2
85
94
45
19
17
24
94
96
w
tu
60
0
4
20
20
60
20
00
0
75
19
48
27
89
98
2
0
22
45
0
12
67
1
10
48
98
23
99
95
05
4
17
37
77
14
5
33
34
09
63
w
tu
60
0
8
20
20
60
20
00
0
75
82
81
84
82
71
4
0
31
41
0
11
79
2
97
64
7
48
03
11
78
2
40
64
82
53
5
36
91
82
18
105
A. RAW TEST DATA A.2. Raw Data of the Red Black Tree Tests
cm
d
du
ra
ti
on
nt
hr
pp
ut
pd
el
pg
et
kr
an
ge
to
ta
l
ld
ab
or
ts
vf
y
ab
or
ts
st
ab
or
ts
se
gf
ab
or
ts
cm
t
ab
or
ts
to
ta
l
ab
or
ts
to
ta
l
ti
m
e
st
m
ti
m
e
ha
rn
ti
m
e
ot
n
60
0
1
5
5
90
20
0
30
59
40
83
0
0
0
0
0
0
60
00
13
29
32
88
88
83
13
61
16
37
ot
n
60
0
2
5
5
90
20
0
51
48
85
89
0
0
10
94
2
0
25
40
94
26
50
36
12
00
01
89
7
74
35
99
31
22
92
70
10
ot
n
60
0
4
5
5
90
20
0
10
28
60
34
4
0
0
64
41
6
25
15
75
09
1
16
39
53
2
24
00
00
61
5
14
85
74
73
8
45
72
01
16
ot
n
60
0
8
5
5
90
20
0
10
26
27
20
0
0
0
62
76
4
10
15
65
46
4
16
28
23
8
47
99
62
24
8
38
82
69
10
8
45
99
58
94
ot
n
60
0
1
20
20
60
20
0
29
44
81
44
0
0
0
0
0
0
60
00
53
74
33
87
78
86
13
10
24
12
ot
n
60
0
2
20
20
60
20
0
45
90
00
67
0
0
14
34
06
12
85
05
58
99
39
76
11
99
99
29
1
79
08
82
51
20
63
95
80
ot
n
60
0
4
20
20
60
20
0
86
25
57
51
2
0
10
97
06
9
21
7
64
18
38
9
75
15
67
7
24
00
04
34
3
16
34
72
82
3
38
39
96
16
ot
n
60
0
8
20
20
60
20
0
87
70
80
62
1
0
87
64
67
21
3
52
78
33
9
61
55
02
0
47
99
83
30
5
40
18
47
31
6
39
13
95
04
ot
n
60
0
1
5
5
90
20
00
0
28
46
04
72
0
0
0
0
0
0
60
00
27
96
34
69
17
24
12
65
98
66
ot
n
60
0
2
5
5
90
20
00
0
48
46
92
84
0
0
15
43
0
89
00
0
90
54
3
12
00
01
01
9
76
91
36
29
21
55
07
14
ot
n
60
0
4
5
5
90
20
00
0
97
89
14
33
0
0
92
35
0
52
42
03
53
34
38
24
00
00
15
0
15
29
08
35
2
43
56
76
32
ot
n
60
0
8
5
5
90
20
00
0
97
89
80
44
0
0
96
59
0
52
28
24
53
24
83
47
99
93
75
2
35
83
32
97
9
62
58
09
97
ot
n
60
0
1
20
20
60
20
00
0
27
48
52
37
0
0
0
0
0
0
60
00
15
40
35
51
31
63
12
29
60
30
ot
n
60
0
2
20
20
60
20
00
0
43
99
65
72
0
0
21
02
9
0
36
33
41
38
43
70
12
00
07
51
8
80
91
10
70
19
60
20
67
ot
n
60
0
4
20
20
60
20
00
0
86
49
56
36
0
0
12
58
06
1
21
44
29
3
22
70
10
0
23
99
94
87
2
16
31
62
44
0
38
51
74
40
ot
n
60
0
8
20
20
60
20
00
0
86
11
43
82
0
0
12
59
03
2
21
31
47
8
22
57
38
3
47
99
35
79
7
39
53
39
28
1
41
88
85
48
ot
p
60
0
1
5
5
90
20
0
29
64
75
54
0
0
0
0
0
0
60
00
54
83
33
87
99
56
13
15
91
20
ot
p
60
0
2
5
5
90
20
0
54
52
42
61
0
28
08
94
34
72
0
12
15
5
29
65
21
11
99
95
52
4
71
92
46
80
24
20
72
49
ot
p
60
0
4
5
5
90
20
0
10
75
02
02
8
0
21
55
82
3
28
58
5
0
85
65
1
22
70
05
9
23
99
70
27
2
14
52
66
36
7
47
68
62
91
ot
p
60
0
8
5
5
90
20
0
10
64
48
87
6
0
16
45
25
3
21
67
1
0
72
06
9
17
38
99
3
47
99
99
96
7
38
55
73
16
8
47
56
65
70
ot
p
60
0
1
20
20
60
20
0
28
25
07
96
0
0
0
0
0
0
60
00
42
69
35
09
60
21
12
54
75
99
ot
p
60
0
2
20
20
60
20
0
47
19
11
76
0
89
96
42
43
69
9
0
15
21
63
10
95
50
4
11
99
99
87
3
78
36
32
02
20
96
92
29
ot
p
60
0
4
20
20
60
20
0
89
82
15
87
0
54
78
58
8
28
25
03
0
90
14
75
66
62
56
6
23
99
98
45
4
16
08
00
61
5
39
84
74
00
ot
p
60
0
8
20
20
60
20
0
89
54
57
09
0
54
36
85
6
28
24
95
0
90
80
46
66
27
39
7
47
99
50
28
2
40
08
98
45
8
39
77
84
11
ot
p
60
0
1
5
5
90
20
00
0
26
41
37
30
0
0
0
0
0
0
60
00
02
62
36
72
31
38
11
72
36
97
ot
p
60
0
2
5
5
90
20
00
0
47
71
71
58
0
57
35
7
13
14
0
90
49
67
72
0
12
00
08
11
5
77
77
86
41
21
30
47
16
ot
p
60
0
4
5
5
90
20
00
0
98
07
14
55
0
34
89
93
79
42
0
56
18
5
41
31
20
23
99
87
75
0
15
35
84
17
6
43
49
90
70
ot
p
60
0
8
5
5
90
20
00
0
98
12
88
71
0
35
51
78
80
41
0
56
08
5
41
93
04
47
98
78
85
4
37
34
36
71
0
52
43
42
00
ot
p
60
0
1
20
20
60
20
00
0
25
09
47
58
0
0
0
0
0
0
60
00
61
79
37
87
44
00
11
13
58
28
ot
p
60
0
2
20
20
60
20
00
0
43
06
60
05
0
18
65
65
17
20
3
0
12
21
31
32
58
99
12
00
12
67
2
81
96
53
89
19
13
96
21
ot
p
60
0
4
20
20
60
20
00
0
84
61
07
61
0
11
60
09
0
10
02
40
0
69
05
61
19
50
89
1
23
99
95
12
8
16
53
60
14
9
37
56
75
38
ot
p
60
0
8
20
20
60
20
00
0
85
31
01
95
0
11
77
23
5
99
83
2
0
69
63
01
19
73
36
8
47
98
50
71
9
39
93
12
12
3
40
35
09
50
106
A. RAW TEST DATA A.2. Raw Data of the Red Black Tree Tests
cm
d
du
ra
ti
on
nt
hr
pp
ut
pd
el
pg
et
kr
an
ge
to
ta
l
ld
ab
or
ts
vf
y
ab
or
ts
st
ab
or
ts
se
gf
ab
or
ts
cm
t
ab
or
ts
to
ta
l
ab
or
ts
to
ta
l
ti
m
e
st
m
ti
m
e
ha
rn
ti
m
e
ot
f
60
0
1
5
5
90
20
0
27
36
07
20
0
0
0
0
0
0
60
00
17
01
35
80
73
11
12
17
23
64
ot
f
60
0
2
5
5
90
20
0
50
45
18
30
0
33
38
27
35
42
0
87
18
34
60
87
12
00
00
43
7
75
35
68
27
22
44
92
11
ot
f
60
0
4
5
5
90
20
0
99
75
88
46
0
19
95
04
6
22
90
0
0
53
51
0
20
71
45
6
23
99
99
00
2
15
18
15
07
4
44
34
70
26
ot
f
60
0
8
5
5
90
20
0
99
45
31
91
0
19
26
29
0
22
22
3
0
52
97
4
20
01
48
7
48
00
00
34
8
39
17
75
79
6
44
38
56
16
ot
f
60
0
1
20
20
60
20
0
25
69
40
91
0
0
0
0
0
0
60
00
30
04
37
14
25
05
11
49
31
92
ot
f
60
0
2
20
20
60
20
0
43
17
86
44
0
10
98
17
9
43
90
5
0
10
94
55
12
51
53
9
12
00
00
55
6
81
65
39
18
19
23
56
89
ot
f
60
0
4
20
20
60
20
0
81
55
52
48
0
65
40
43
7
28
51
89
0
65
20
87
74
77
71
3
24
00
00
33
3
16
74
54
95
6
36
47
73
79
ot
f
60
0
8
20
20
60
20
0
81
04
82
73
0
63
66
33
3
28
06
75
0
65
17
26
72
98
73
4
47
99
87
25
9
40
80
25
53
8
36
07
82
83
ot
f
60
0
1
5
5
90
20
00
0
23
13
49
32
0
0
0
0
0
0
60
00
27
31
39
56
10
90
10
27
97
19
ot
f
60
0
2
5
5
90
20
00
0
43
35
37
90
0
59
66
8
14
16
0
56
31
66
71
5
11
99
99
99
3
81
65
35
15
19
28
71
47
ot
f
60
0
4
5
5
90
20
00
0
86
28
31
34
0
36
17
31
85
93
0
32
73
5
40
30
59
23
99
95
82
0
16
37
39
56
9
38
34
67
44
ot
f
60
0
8
5
5
90
20
00
0
86
57
08
10
0
37
35
79
93
14
0
33
18
4
41
60
77
47
99
12
44
8
38
92
87
53
4
45
61
48
13
ot
f
60
0
1
20
20
60
20
00
0
21
75
69
06
0
0
0
0
0
0
60
00
20
35
40
71
18
50
96
71
62
4
ot
f
60
0
2
20
20
60
20
00
0
38
01
60
82
0
22
86
81
18
56
9
0
72
90
3
32
01
53
12
00
07
60
8
86
29
60
41
16
91
64
53
ot
f
60
0
4
20
20
60
20
00
0
74
25
13
79
0
14
19
05
2
11
06
46
0
41
65
19
19
46
21
7
23
99
88
67
0
17
41
65
16
0
33
00
65
92
ot
f
60
0
8
20
20
60
20
00
0
74
74
21
95
0
13
75
19
0
10
91
55
0
41
46
87
18
99
03
2
47
99
42
47
6
41
03
24
97
6
35
06
38
76
oa
n
60
0
1
5
5
90
20
0
31
52
70
72
0
0
0
0
0
0
60
00
24
27
32
21
44
19
13
99
76
69
oa
n
60
0
2
5
5
90
20
0
53
24
33
52
0
0
89
74
0
22
29
55
23
19
29
11
99
95
82
1
73
04
06
02
23
65
37
21
oa
n
60
0
4
5
5
90
20
0
10
74
80
39
5
0
0
53
65
8
11
13
72
47
2
14
26
14
1
24
00
08
16
4
14
52
79
60
7
47
70
83
70
oa
n
60
0
8
5
5
90
20
0
10
65
51
61
7
0
0
52
93
6
7
13
47
97
5
14
00
91
8
47
99
90
22
7
38
45
49
67
5
48
19
52
13
oa
n
60
0
1
20
20
60
20
0
30
33
08
16
0
0
0
0
0
0
60
00
30
87
33
27
05
84
13
45
93
58
oa
n
60
0
2
20
20
60
20
0
48
39
77
36
0
0
12
31
78
5
75
66
18
87
98
01
12
00
02
32
7
77
32
08
38
21
48
69
23
oa
n
60
0
4
20
20
60
20
0
94
09
33
08
0
0
76
06
65
15
7
48
09
64
3
55
70
46
5
24
00
01
08
1
15
68
41
27
1
41
98
61
47
oa
n
60
0
8
20
20
60
20
0
93
50
64
48
0
0
74
30
93
17
6
47
51
04
2
54
94
31
1
48
00
35
27
7
39
74
42
01
0
41
59
87
55
oa
n
60
0
1
5
5
90
20
00
0
29
87
56
07
0
0
0
0
0
0
60
00
49
90
33
62
30
95
13
35
20
52
oa
n
60
0
2
5
5
90
20
00
0
50
86
44
86
0
0
14
77
0
87
03
7
88
51
4
11
99
93
48
8
75
16
40
28
22
57
33
15
oa
n
60
0
4
5
5
90
20
00
0
10
27
22
98
5
0
0
83
65
0
48
19
90
49
03
55
24
00
18
01
3
14
95
71
65
7
45
58
87
64
oa
n
60
0
8
5
5
90
20
00
0
10
27
96
12
8
0
0
85
01
0
49
09
20
49
94
21
47
98
70
27
7
34
19
89
62
7
71
78
74
19
oa
n
60
0
1
20
20
60
20
00
0
28
67
71
86
0
0
0
0
0
0
60
00
15
84
34
72
71
66
12
71
81
23
oa
n
60
0
2
20
20
60
20
00
0
46
73
76
24
0
0
20
19
4
0
34
43
95
36
45
89
12
00
01
83
0
78
78
06
51
20
76
05
81
oa
n
60
0
4
20
20
60
20
00
0
91
88
06
01
0
0
11
31
80
0
19
84
19
6
20
97
37
6
24
00
09
11
0
15
90
39
99
3
40
78
68
86
oa
n
60
0
8
20
20
60
20
00
0
92
95
05
50
0
0
11
61
17
0
19
94
94
0
21
11
05
7
47
96
19
23
1
38
47
73
79
3
47
51
96
16
107
A. RAW TEST DATA A.2. Raw Data of the Red Black Tree Tests
cm
d
du
ra
ti
on
nt
hr
pp
ut
pd
el
pg
et
kr
an
ge
to
ta
l
ld
ab
or
ts
vf
y
ab
or
ts
st
ab
or
ts
se
gf
ab
or
ts
cm
t
ab
or
ts
to
ta
l
ab
or
ts
to
ta
l
ti
m
e
st
m
ti
m
e
ha
rn
ti
m
e
oa
p
60
0
1
5
5
90
20
0
29
58
66
22
0
0
0
0
0
0
60
00
13
72
33
88
43
69
13
13
49
39
oa
p
60
0
2
5
5
90
20
0
54
89
97
86
0
22
48
32
29
51
0
10
36
1
23
81
44
12
00
01
55
3
71
35
30
74
24
57
42
74
oa
p
60
0
4
5
5
90
20
0
10
90
32
50
4
0
12
98
33
8
18
82
4
0
61
01
2
13
78
17
4
24
00
06
01
6
14
37
75
84
0
48
46
97
00
oa
p
60
0
8
5
5
90
20
0
10
90
92
44
4
0
13
21
90
0
18
86
7
0
62
32
5
14
03
09
2
47
99
10
49
0
38
24
03
23
0
49
36
24
34
oa
p
60
0
1
20
20
60
20
0
28
16
41
82
0
0
0
0
0
0
60
00
24
67
35
05
68
23
12
58
60
29
oa
p
60
0
2
20
20
60
20
0
48
22
62
10
0
74
79
67
38
99
4
0
13
55
45
92
25
06
11
99
95
24
1
77
17
95
05
21
67
21
66
oa
p
60
0
4
20
20
60
20
0
93
02
98
77
0
45
67
54
0
25
85
57
0
80
13
81
56
27
47
8
24
00
07
50
1
15
78
96
21
3
41
35
48
69
oa
p
60
0
8
20
20
60
20
0
91
61
99
41
0
47
43
34
3
25
35
28
0
81
34
52
58
10
32
3
48
00
14
92
5
39
90
22
76
0
40
80
68
97
oa
p
60
0
1
5
5
90
20
00
0
26
46
43
13
0
0
0
0
0
0
60
00
01
94
36
63
17
62
11
77
75
43
oa
p
60
0
2
5
5
90
20
00
0
49
71
76
96
0
50
40
2
11
46
0
83
89
59
93
7
12
00
07
91
4
75
94
82
60
22
25
05
48
oa
p
60
0
4
5
5
90
20
00
0
99
51
62
65
0
28
31
35
71
91
0
48
95
2
33
92
78
24
00
22
59
6
15
21
84
22
4
44
23
97
84
oa
p
60
0
8
5
5
90
20
00
0
99
50
03
57
0
29
08
60
69
64
0
49
21
8
34
70
42
47
98
98
97
1
35
87
59
17
7
60
09
99
74
oa
p
60
0
1
20
20
60
20
00
0
25
19
27
43
0
0
0
0
0
0
60
00
17
46
37
77
20
95
11
19
53
41
oa
p
60
0
2
20
20
60
20
00
0
44
52
26
68
0
16
78
18
15
92
3
0
11
49
24
29
86
65
11
99
99
09
8
80
68
06
99
19
80
40
06
oa
p
60
0
4
20
20
60
20
00
0
87
19
27
42
0
10
08
06
7
91
57
3
0
62
71
40
17
26
78
0
24
00
33
00
4
16
29
51
56
4
38
86
98
50
oa
p
60
0
8
20
20
60
20
00
0
87
62
27
80
0
99
75
65
93
87
4
0
63
78
60
17
29
29
9
47
97
63
54
3
38
96
36
22
3
44
15
60
38
oa
f
60
0
1
5
5
90
20
0
27
51
05
85
0
0
0
0
0
0
60
00
59
17
35
71
33
44
12
21
88
35
oa
f
60
0
2
5
5
90
20
0
51
19
97
18
0
27
88
32
32
52
0
79
30
29
00
14
11
99
98
86
0
74
77
51
02
22
75
32
06
oa
f
60
0
4
5
5
90
20
0
10
13
75
70
1
0
16
34
85
7
20
24
5
0
47
08
2
17
02
18
4
23
99
86
89
2
15
05
29
68
9
44
97
84
28
oa
f
60
0
8
5
5
90
20
0
10
15
50
30
7
0
16
52
77
4
20
22
1
0
47
13
6
17
20
13
1
47
99
80
19
4
38
99
62
93
5
45
31
04
73
oa
f
60
0
1
20
20
60
20
0
25
80
93
12
0
0
0
0
0
0
60
00
39
94
37
04
42
37
11
54
29
33
oa
f
60
0
2
20
20
60
20
0
44
54
85
44
0
96
73
41
40
99
4
0
10
17
06
11
10
04
1
11
99
99
21
2
80
48
44
76
19
81
20
47
oa
f
60
0
4
20
20
60
20
0
84
77
94
72
0
57
49
46
9
26
45
66
0
59
83
96
66
12
43
1
23
99
97
37
8
16
48
21
87
1
37
68
19
29
oa
f
60
0
8
20
20
60
20
0
84
63
80
35
0
57
41
55
7
26
28
42
0
59
86
06
66
03
00
5
47
99
96
20
1
40
48
77
23
6
37
64
31
97
oa
f
60
0
1
5
5
90
20
00
0
23
24
25
25
0
0
0
0
0
0
60
00
40
03
39
48
84
92
10
31
91
07
oa
f
60
0
2
5
5
90
20
00
0
44
31
63
50
0
52
64
4
13
86
0
53
81
59
41
1
12
00
08
54
1
80
86
68
73
19
69
05
68
oa
f
60
0
4
5
5
90
20
00
0
87
55
19
21
0
30
30
10
77
13
0
29
81
2
34
05
35
23
99
98
06
7
16
24
85
12
6
39
10
08
66
oa
f
60
0
8
5
5
90
20
00
0
87
94
84
49
0
31
11
70
77
70
0
30
08
9
34
90
29
47
98
58
94
2
38
26
66
53
7
48
71
08
28
oa
f
60
0
1
20
20
60
20
00
0
21
92
71
75
0
0
0
0
0
0
60
00
19
81
40
56
70
15
97
40
82
4
oa
f
60
0
2
20
20
60
20
00
0
38
91
50
85
0
21
24
62
17
37
1
0
68
76
1
29
85
94
11
99
99
02
4
85
41
42
69
17
39
28
38
oa
f
60
0
4
20
20
60
20
00
0
77
58
27
00
0
12
35
70
8
10
47
94
0
39
05
14
17
31
01
6
23
99
93
65
9
17
10
41
80
7
34
68
04
03
oa
f
60
0
8
20
20
60
20
00
0
77
40
39
76
0
12
39
38
8
10
34
98
0
38
91
52
17
32
03
8
48
00
12
26
1
40
56
91
69
7
36
78
02
45
108
A. RAW TEST DATA A.2. Raw Data of the Red Black Tree Tests
cm
d
du
ra
ti
on
nt
hr
pp
ut
pd
el
pg
et
kr
an
ge
to
ta
l
ld
ab
or
ts
vf
y
ab
or
ts
st
ab
or
ts
se
gf
ab
or
ts
cm
t
ab
or
ts
to
ta
l
ab
or
ts
to
ta
l
ti
m
e
st
m
ti
m
e
ha
rn
ti
m
e
TL
2
60
0
1
5
5
90
20
0
25
35
90
92
0
0
0
0
0
0
60
00
36
03
37
71
69
95
11
21
32
85
TL
2
60
0
2
5
5
90
20
0
47
16
44
65
0
0
0
0
0
0
12
00
03
08
6
78
55
01
68
20
87
79
01
TL
2
60
0
4
5
5
90
20
0
93
19
43
06
0
0
0
0
0
0
23
99
89
56
9
15
78
37
85
9
41
51
82
50
TL
2
60
0
8
5
5
90
20
0
85
46
33
51
0
0
0
0
0
0
47
99
32
51
2
33
19
49
55
2
71
64
16
84
TL
2
60
0
1
20
20
60
20
0
22
67
17
25
0
0
0
0
0
0
60
00
02
80
40
09
61
05
10
03
16
03
TL
2
60
0
2
20
20
60
20
0
39
34
24
57
0
0
0
0
0
0
12
00
08
48
6
85
39
97
94
17
41
73
07
TL
2
60
0
4
20
20
60
20
0
75
87
21
61
0
0
0
0
0
0
24
00
00
77
4
17
31
83
31
5
33
61
44
53
TL
2
60
0
8
20
20
60
20
0
56
84
88
63
0
0
0
0
0
0
47
98
82
75
3
37
58
97
23
6
50
90
70
48
TL
2
60
0
1
5
5
90
20
00
0
20
14
79
84
0
0
0
0
0
0
60
00
42
66
42
28
82
51
89
18
22
0
TL
2
60
0
2
5
5
90
20
00
0
38
19
33
02
0
0
0
0
0
0
12
00
07
63
4
86
41
76
84
16
94
19
22
TL
2
60
0
4
5
5
90
20
00
0
76
08
82
90
0
0
0
0
0
0
23
99
84
28
9
17
31
75
51
7
33
65
47
47
TL
2
60
0
8
5
5
90
20
00
0
71
09
96
25
0
0
0
0
0
0
47
98
32
11
4
35
47
15
32
0
63
22
60
53
TL
2
60
0
1
20
20
60
20
00
0
18
14
19
97
0
0
0
0
0
0
60
00
64
70
44
05
54
08
80
22
36
3
TL
2
60
0
2
20
20
60
20
00
0
32
64
06
60
0
0
0
0
0
0
12
00
05
33
4
91
30
52
14
14
45
84
53
TL
2
60
0
4
20
20
60
20
00
0
64
39
66
03
0
0
0
0
0
0
24
00
15
27
7
18
32
36
62
3
28
71
19
59
TL
2
60
0
8
20
20
60
20
00
0
53
34
74
55
0
0
0
0
0
0
48
06
20
20
4
38
89
24
96
8
48
45
36
30
En
na
ls
60
0
1
5
5
90
20
0
31
00
82
35
0
0
0
0
0
0
60
00
35
00
32
58
62
28
13
82
70
09
En
na
ls
60
0
2
5
5
90
20
0
57
94
84
73
0
0
0
0
0
0
12
00
04
87
7
68
67
03
32
25
93
36
76
En
na
ls
60
0
4
5
5
90
20
0
11
38
66
79
5
0
0
0
0
0
0
23
99
86
54
1
13
90
17
28
5
51
01
55
16
En
na
ls
60
0
8
5
5
90
20
0
10
60
58
55
6
0
0
0
0
0
0
47
99
07
81
0
29
34
51
60
3
95
41
66
48
En
na
ls
60
0
1
20
20
60
20
0
28
97
51
53
0
0
0
0
0
0
60
00
28
22
34
28
28
63
13
00
06
02
En
na
ls
60
0
2
20
20
60
20
0
51
63
64
38
0
0
0
0
0
0
11
99
99
02
7
74
26
91
41
23
06
80
11
En
na
ls
60
0
4
20
20
60
20
0
99
84
70
06
0
0
0
0
0
0
23
99
87
97
9
15
12
84
52
2
44
78
94
78
En
na
ls
60
0
8
20
20
60
20
0
72
72
68
81
0
0
0
0
0
0
47
95
16
49
2
35
37
60
72
7
64
10
34
39
En
na
ls
60
0
1
5
5
90
20
00
0
28
33
73
32
0
0
0
0
0
0
60
00
66
15
34
95
39
22
12
64
31
92
En
na
ls
60
0
2
5
5
90
20
00
0
53
31
99
91
0
0
0
0
0
0
12
00
10
04
1
72
82
71
97
23
82
41
09
En
na
ls
60
0
4
5
5
90
20
00
0
10
67
50
42
6
0
0
0
0
0
0
23
99
99
47
5
14
55
64
86
1
47
67
83
44
En
na
ls
60
0
8
5
5
90
20
00
0
10
08
14
26
1
0
0
0
0
0
0
48
00
31
71
6
30
55
03
27
7
88
29
63
04
En
na
ls
60
0
1
20
20
60
20
00
0
26
84
39
27
0
0
0
0
0
0
60
00
36
85
36
25
78
92
11
98
39
16
En
na
ls
60
0
2
20
20
60
20
00
0
49
31
28
23
0
0
0
0
0
0
11
99
99
80
4
76
35
29
67
22
01
38
03
En
na
ls
60
0
4
20
20
60
20
00
0
96
11
03
99
0
0
0
0
0
0
24
00
02
01
3
15
48
39
02
4
42
97
26
13
En
na
ls
60
0
8
20
20
60
20
00
0
76
08
75
96
0
0
0
0
0
0
47
99
10
40
3
34
69
62
98
9
70
68
56
94
109
A. RAW TEST DATA A.2. Raw Data of the Red Black Tree Tests
cm
d
du
ra
ti
on
nt
hr
pp
ut
pd
el
pg
et
kr
an
ge
to
ta
l
ld
ab
or
ts
vf
y
ab
or
ts
st
ab
or
ts
se
gf
ab
or
ts
cm
t
ab
or
ts
to
ta
l
ab
or
ts
to
ta
l
ti
m
e
st
m
ti
m
e
ha
rn
ti
m
e
Vo
id
60
0
1
5
5
90
20
0
43
37
85
65
0
0
0
0
0
0
60
00
09
96
21
74
24
46
19
28
96
42
Vo
id
60
0
1
20
20
60
20
0
0
0
0
0
0
0
0
60
00
06
82
22
52
48
14
18
88
72
28
Vo
id
60
0
1
5
5
90
20
00
0
0
0
0
0
0
0
0
60
00
13
58
23
86
56
34
18
22
20
03
Vo
id
60
0
1
20
20
60
20
00
0
40
20
25
64
0
0
0
0
0
0
60
00
05
58
24
52
99
91
17
88
05
62
110
A. RAW TEST DATA A.3. Raw Data of the Sorted List Tests
A.3 Raw Data of the Sorted List Tests
111
A. RAW TEST DATA A.3. Raw Data of the Sorted List Tests
cm
d
du
ra
ti
on
nt
hr
pp
ut
pd
el
pg
et
kr
an
ge
to
ta
l
ld
ab
or
ts
vf
y
ab
or
ts
st
ab
or
ts
se
gf
ab
or
ts
cm
t
ab
or
ts
to
ta
l
ab
or
ts
to
ta
l
ti
m
e
st
m
ti
m
e
ha
rn
ti
m
e
w
tr
60
0
1
5
5
90
20
0
13
81
08
86
0
0
0
0
0
0
60
00
26
59
47
78
29
39
61
66
24
7
w
tr
60
0
2
5
5
90
20
0
25
74
88
60
61
81
61
0
0
0
64
38
3
68
25
44
12
00
07
86
7
97
19
44
26
11
44
61
65
w
tr
60
0
4
5
5
90
20
0
49
03
14
35
39
30
50
8
0
0
0
41
16
19
43
42
12
7
23
99
93
21
0
19
64
00
06
9
21
79
51
63
w
tr
60
0
8
5
5
90
20
0
49
90
47
79
35
19
95
7
0
0
0
42
01
09
39
40
06
6
47
99
14
36
7
43
53
89
79
1
22
36
39
57
w
tr
60
0
1
20
20
60
20
0
12
28
37
37
0
0
0
0
0
0
60
00
50
26
49
17
79
35
54
49
54
1
w
tr
60
0
2
20
20
60
20
0
19
87
41
49
19
22
19
9
0
0
0
86
45
80
27
86
77
9
12
00
06
19
9
10
22
64
81
4
88
49
85
0
w
tr
60
0
4
20
20
60
20
0
28
53
12
94
88
41
39
5
0
0
0
42
02
78
9
13
04
41
84
24
00
01
15
0
21
44
63
01
3
12
71
80
22
w
tr
60
0
8
20
20
60
20
0
27
81
55
91
83
40
64
5
0
0
0
39
22
09
5
12
26
27
40
48
00
06
06
5
45
51
09
55
9
12
39
68
82
w
tu
60
0
1
5
5
90
20
0
14
29
41
17
0
0
0
0
0
0
60
00
18
81
47
39
34
85
63
47
29
9
w
tu
60
0
2
5
5
90
20
0
27
72
86
51
58
46
54
0
65
0
58
85
4
64
35
73
12
00
07
46
5
95
46
61
02
12
32
95
39
w
tu
60
0
4
5
5
90
20
0
52
87
26
57
34
78
55
8
0
50
6
0
35
58
93
38
34
95
7
23
99
99
10
0
19
30
58
35
2
23
54
04
65
w
tu
60
0
8
5
5
90
20
0
52
76
92
53
34
66
57
6
0
53
5
0
35
79
43
38
25
05
4
47
99
57
64
3
43
27
08
42
1
23
64
23
08
w
tu
60
0
1
20
20
60
20
0
13
03
21
69
0
0
0
0
0
0
60
00
52
04
48
50
01
18
57
85
05
1
w
tu
60
0
2
20
20
60
20
0
21
54
26
10
19
54
40
9
0
10
18
0
77
83
38
27
33
76
5
11
99
99
65
1
10
07
62
71
9
96
06
92
2
w
tu
60
0
4
20
20
60
20
0
29
02
22
53
11
44
64
49
0
25
23
0
0
32
32
82
5
14
70
45
04
24
00
02
35
4
21
39
81
07
2
12
95
76
82
w
tu
60
0
8
20
20
60
20
0
29
28
60
73
10
52
81
99
0
17
18
5
0
30
60
74
9
13
60
61
33
47
99
14
82
9
45
36
16
07
6
13
11
13
25
ot
n
60
0
1
5
5
90
20
0
26
67
02
53
0
0
0
0
0
0
60
00
05
90
36
49
15
91
11
84
66
30
ot
n
60
0
2
5
5
90
20
0
45
02
42
73
0
0
73
94
6
6
82
95
10
90
34
62
12
00
04
95
1
79
98
79
94
20
15
49
92
ot
n
60
0
4
5
5
90
20
0
85
30
89
10
0
0
48
50
64
38
52
81
14
0
57
66
24
2
23
99
96
68
3
16
43
39
10
3
37
90
95
61
ot
n
60
0
8
5
5
90
20
0
86
53
32
71
0
0
48
66
05
13
52
77
90
8
57
64
52
6
47
99
64
71
0
40
20
95
93
0
39
11
10
10
ot
n
60
0
1
20
20
60
20
0
25
57
36
09
0
0
0
0
0
0
60
00
58
55
37
41
09
23
11
41
44
48
ot
n
60
0
2
20
20
60
20
0
36
76
37
07
0
0
10
50
24
6
21
24
36
82
7
34
87
09
4
12
00
09
82
5
86
92
12
75
16
54
79
34
ot
n
60
0
4
20
20
60
20
0
64
08
68
68
0
0
68
57
09
4
30
8
13
53
55
79
20
39
29
81
23
99
98
40
9
18
23
59
70
3
28
78
81
42
ot
n
60
0
8
20
20
60
20
0
63
73
62
35
0
0
68
79
21
1
29
8
13
47
33
71
20
35
28
80
48
00
25
63
2
42
26
33
64
3
28
61
90
99
ot
p
60
0
1
5
5
90
20
0
18
07
15
65
0
0
0
0
0
0
60
00
13
24
44
07
98
95
80
19
88
7
ot
p
60
0
2
5
5
90
20
0
33
72
77
03
0
65
05
57
57
2
0
42
05
3
69
31
82
12
00
01
78
6
90
10
89
28
15
03
99
30
ot
p
60
0
4
5
5
90
20
0
65
46
91
73
0
39
16
71
8
39
78
0
26
48
43
41
85
53
9
24
00
01
07
8
18
17
62
33
5
29
28
11
85
ot
p
60
0
8
5
5
90
20
0
64
94
44
39
0
38
85
72
5
39
43
0
26
19
43
41
51
61
1
47
99
68
21
1
42
06
88
96
8
29
70
90
64
ot
p
60
0
1
20
20
60
20
0
17
52
90
44
0
0
0
0
0
0
60
00
06
28
44
55
63
84
77
77
06
2
ot
p
60
0
2
20
20
60
20
0
28
86
99
76
0
22
50
31
1
86
49
0
65
73
23
29
16
28
3
12
00
03
13
5
94
21
84
40
12
89
35
23
ot
p
60
0
4
20
20
60
20
0
46
44
36
91
0
13
45
62
78
65
63
7
0
34
55
90
5
16
97
78
20
24
00
18
29
8
19
82
36
93
0
20
81
31
63
ot
p
60
0
8
20
20
60
20
0
45
97
41
82
0
13
08
11
37
60
52
4
0
33
85
05
9
16
52
67
20
48
00
11
85
0
43
86
44
29
8
20
59
36
90
112
A. RAW TEST DATA A.3. Raw Data of the Sorted List Tests
cm
d
du
ra
ti
on
nt
hr
pp
ut
pd
el
pg
et
kr
an
ge
to
ta
l
ld
ab
or
ts
vf
y
ab
or
ts
st
ab
or
ts
se
gf
ab
or
ts
cm
t
ab
or
ts
to
ta
l
ab
or
ts
to
ta
l
ti
m
e
st
m
ti
m
e
ha
rn
ti
m
e
ot
f
60
0
1
5
5
90
20
0
13
62
98
38
0
0
0
0
0
0
60
00
37
74
47
97
05
85
60
50
79
8
ot
f
60
0
2
5
5
90
20
0
25
33
84
97
0
54
35
50
46
4
0
45
33
4
58
93
48
12
00
07
32
7
97
54
90
41
11
26
67
76
ot
f
60
0
4
5
5
90
20
0
48
72
24
70
0
32
90
27
2
30
48
0
27
00
87
35
63
40
7
23
99
88
24
5
19
67
34
63
3
21
66
39
04
ot
f
60
0
8
5
5
90
20
0
49
10
57
80
0
33
18
71
2
31
40
0
27
15
00
35
93
35
2
47
99
44
07
3
43
59
25
72
4
22
02
72
38
ot
f
60
0
1
20
20
60
20
0
12
76
32
47
0
0
0
0
0
0
60
00
39
18
48
68
18
36
56
64
18
3
ot
f
60
0
2
20
20
60
20
0
20
91
52
74
0
19
15
70
8
66
08
0
64
49
94
25
67
31
0
11
99
96
33
0
10
12
43
34
7
93
15
91
5
ot
f
60
0
4
20
20
60
20
0
31
11
81
94
0
10
52
87
17
37
74
1
0
27
97
42
1
13
36
38
79
24
00
00
38
8
21
19
38
96
4
13
90
01
20
ot
f
60
0
8
20
20
60
20
0
30
68
61
22
0
10
20
89
43
36
05
2
0
27
35
70
7
12
98
07
02
48
00
04
79
6
45
23
47
88
6
13
70
85
08
oa
n
60
0
1
5
5
90
20
0
26
71
58
84
0
0
0
0
0
0
60
00
24
77
36
39
38
65
11
92
28
33
oa
n
60
0
2
5
5
90
20
0
45
88
64
76
0
0
63
91
1
72
7
72
33
21
78
79
59
12
00
08
95
5
79
39
44
26
20
41
80
75
oa
n
60
0
4
5
5
90
20
0
88
37
21
53
0
0
43
55
66
34
10
47
49
08
3
51
88
05
9
24
00
15
15
2
16
13
77
34
8
39
61
11
63
oa
n
60
0
8
5
5
90
20
0
88
37
79
71
0
0
42
86
98
33
41
47
19
02
0
51
51
05
9
47
99
47
15
8
39
90
39
84
9
40
85
06
13
oa
n
60
0
1
20
20
60
20
0
25
75
13
66
0
0
0
0
0
0
60
00
21
70
37
31
00
35
11
43
07
59
oa
n
60
0
2
20
20
60
20
0
39
94
08
25
0
0
84
47
76
18
79
19
70
14
1
28
16
79
6
11
99
99
47
1
84
33
49
52
17
85
23
13
oa
n
60
0
4
20
20
60
20
0
67
82
11
81
0
0
63
23
54
3
11
44
7
12
83
83
03
19
17
32
93
23
99
98
09
3
17
88
64
42
8
30
58
43
31
oa
n
60
0
8
20
20
60
20
0
66
14
64
50
0
0
61
80
37
8
12
10
2
12
53
43
10
18
72
67
90
47
99
73
93
5
42
03
80
26
3
29
74
08
26
oa
p
60
0
1
5
5
90
20
0
18
03
34
75
0
0
0
0
0
0
59
99
99
39
44
00
90
50
80
41
85
0
oa
p
60
0
2
5
5
90
20
0
33
89
76
22
0
59
50
06
61
8
0
39
36
7
63
49
91
12
00
06
56
7
89
81
63
35
15
15
67
05
oa
p
60
0
4
5
5
90
20
0
63
66
67
20
0
35
48
70
4
38
50
0
24
14
40
37
93
99
4
23
99
91
55
3
18
31
03
41
6
28
50
93
33
oa
p
60
0
8
5
5
90
20
0
65
39
70
90
0
35
98
65
1
38
69
0
24
67
37
38
49
25
7
47
99
43
42
2
42
00
20
54
5
30
18
10
02
oa
p
60
0
1
20
20
60
20
0
17
45
54
61
0
0
0
0
0
0
60
00
33
08
44
44
78
99
77
86
35
0
oa
p
60
0
2
20
20
60
20
0
29
50
92
31
0
21
24
58
5
85
95
0
63
40
96
27
67
27
6
12
00
07
97
0
93
35
89
43
13
24
85
19
oa
p
60
0
4
20
20
60
20
0
48
07
43
49
0
12
98
59
13
65
22
2
0
33
81
32
6
16
43
24
61
23
99
79
81
2
19
61
02
55
4
21
80
65
58
oa
p
60
0
8
20
20
60
20
0
47
18
51
68
0
12
83
37
47
64
01
3
0
33
74
97
4
16
27
27
34
47
99
99
05
8
43
70
69
83
9
21
26
66
84
oa
f
60
0
1
5
5
90
20
0
13
76
86
63
0
0
0
0
0
0
60
00
62
51
47
86
92
76
61
10
93
4
oa
f
60
0
2
5
5
90
20
0
25
93
39
39
0
52
53
82
52
4
0
43
81
3
56
97
19
12
00
04
50
3
97
06
31
59
11
52
72
63
oa
f
60
0
4
5
5
90
20
0
49
76
76
30
0
31
83
34
9
32
90
0
26
89
37
34
55
57
6
23
99
96
17
6
19
58
76
11
4
22
13
70
21
oa
f
60
0
8
5
5
90
20
0
49
96
92
44
0
32
21
17
2
32
23
0
26
69
08
34
91
30
3
47
98
17
56
9
43
49
54
28
8
22
47
38
60
oa
f
60
0
1
20
20
60
20
0
12
78
75
70
0
0
0
0
0
0
60
00
15
62
48
74
46
89
56
60
19
2
oa
f
60
0
2
20
20
60
20
0
21
60
96
73
0
18
89
55
9
69
62
0
63
93
55
25
35
87
6
12
00
00
54
0
10
07
49
21
5
96
20
67
1
oa
f
60
0
4
20
20
60
20
0
32
61
74
33
0
10
60
32
70
40
35
4
0
28
54
34
1
13
49
79
65
24
00
02
94
7
21
07
21
52
3
14
60
60
71
oa
f
60
0
8
20
20
60
20
0
32
21
98
43
0
10
23
30
40
38
97
9
0
27
86
96
6
13
05
89
85
47
99
52
20
0
45
09
98
17
1
14
46
31
22
113
A. RAW TEST DATA A.3. Raw Data of the Sorted List Tests
cm
d
du
ra
ti
on
nt
hr
pp
ut
pd
el
pg
et
kr
an
ge
to
ta
l
ld
ab
or
ts
vf
y
ab
or
ts
st
ab
or
ts
se
gf
ab
or
ts
cm
t
ab
or
ts
to
ta
l
ab
or
ts
to
ta
l
ti
m
e
st
m
ti
m
e
ha
rn
ti
m
e
Vo
id
60
0
1
5
5
90
20
0
39
88
83
11
0
0
0
0
0
0
60
00
25
43
24
75
78
46
17
80
23
03
Vo
id
60
0
1
20
20
60
20
0
39
27
48
49
0
0
0
0
0
0
60
00
48
42
25
31
52
41
17
47
14
26
114
Appendix B
STM Engine API
This Appendix shows the API of the prototype.
115
B. STM ENGINE API B.1. STM engine data structures
B.1 STM engine data structures
Thread Contains the thread and transaction state informa-
tion.
B.2 API for transaction initiation and control
Thread * TxNewThread ();
Creates, initializes and returns the Thread data structure. Must be
called by every thread before starting any transaction.
void TxStart (Thread * const self, int roflag);
Starts a transaction. If roflag is 0 the transaction is read-only and it
will have a smaller overhead. Otherwise the transaction is read-write.
int TxValid (Thread * const self);
Scans the read set and returns 1 if the read-set is in updated consistent
state; returns 0 otherwise.
int TxValidateAndAbort (Thread * const self);
Scans the read set and returns 1 when the read set is updated consis-
tent; aborts and retries the transaction otherwise.
int TxCommit (Thread * const self) ;
Commits the transaction. Always return 1. If the transaction cannot
commit, it is restarted.
void TxAbort (Thread * const self, int retry) ;
Aborts the transaction. If retry is set to 1 the transactions is retried.
void TxSterilize (Thread * const self, void volatile * base, size t
const length);
Quiesces the variable with address base and size length. The size
is in number of memory words. This function cannot be used inside
a transaction.
116
B. STM ENGINE API B.3. API for loading and storing data in word based mode
B.3 API for loading and storing data in word based mode
intptr t TxLoad (Thread * const self, intptr t volatile * addr);
Returns the value of the transactional variable with address addr.
The transaction aborts and retries if the variable has changed since
the beginning of this transaction.
void TxStore (Thread * const self, intptr t volatile * addr, intptr t
value);
Stores the value value on the transactional variable with address
addr. The transaction aborts and retries if the variable has changed
since the beginning of this transaction.
B.4 API for loading and storing data in object based mode
int TxOpenRead (Thread * const self, void volatile * addr);
Opens the transactional object/structure with address addr for read.
Transactions must call this function before reading its contents. The
transaction aborts and retries if the object/structure has changed
since the beginning of this transaction.
int TxOpenWrite (Thread * const self, void volatile * addr, un-
signed int size);
Opens the transactional object/structure with address addr and size
size for write. Transactions must call this function before writing.
The transaction aborts and retries if the object/structure has changed
since the beginning of this transaction.
int TxVerifyAddr (Thread * const self, void volatile * addr);
Verifies if the object/structure with address addr has changed by an-
other transaction since the beginning of this transaction and abort-
s/retries if that was the case. Otherwise it returns 1.
117
[This page was intentionally left blank]
Bibliography
[ALS06] Kunal Agrawal, Charles E. Leiserson, and Jim Sukha. Memory models
for open-nested transactions. In MSPC ’06: Proceedings of the 2006 work-
shop on Memory system performance and correctness, pages 70–81, New
York, NY, USA, 2006. ACM Press.
[ATLM+06] Ali-Reza Adl-Tabatabai, Brian T. Lewis, Vijay Menon, Brian R. Murphy,
Bratin Saha, and Tatiana Shpeisman. Compiler and runtime support
for efficient software transactional memory. In PLDI ’06: Proceedings of
the 2006 ACM SIGPLAN conference on Programming language design and
implementation, pages 26–37, New York, NY, USA, 2006. ACM Press.
[CRS06] Joa˜o Cachopo and Anto´nio Rito-Silva. Versioned boxes as the basis for
memory transactions. Sci. Comput. Program., 63(2):172–185, 2006.
[Dat94] C. J. Date. Introduction to Database Systems. Addison-Wesley Longman
Publishing Co., Inc., Boston, MA, USA, 1994.
[DON06] D. Dice, Shalev O., and Shavit N. Transactional Locking II. In Proc. of
the 20th International Symposium on Distributed Computing (DISC 2006),
pages 194–208, Stockholm, Sweden, 2006.
[DS06] David Dice and Nir Shavit. What really makes transactions faster? In
Proceedings of the First ACM SIGPLAN Workshop on Languages, Compilers,
and Hardware Support for Transactional Computing. Jun 2006.
[DS07] D. Dice and N. Shavit. Understanding tradeoffs in software transac-
tional memory. In Proc. of the 2007 International Symposium on Code Gen-
eration and Optimization, 2007.
[Enn06] Robert Ennals. Software transactional memory should not be
obstruction-free. Technical Report IRC-TR-06-052, Intel Research Cam-
bridge Tech Report, Jan 2006.
119
BIBLIOGRAPHY
[HF] T. Harris and K. Fraser. Concurrent programming without locks.
http://research.microsoft.com/˜tharris/drafts/cpwl-submission.pdf.
[HF03] Tim Harris and Keir Fraser. Language support for lightweight trans-
actions. In OOPSLA ’03: Proceedings of the 18th annual ACM SIGPLAN
conference on Object-oriented programing, systems, languages, and applica-
tions, pages 388–402, New York, NY, USA, 2003. ACM Press.
[HLM03] Maurice Herlihy, Victor Luchangco, and Mark Moir. Obstruction-free
synchronization: Double-ended queues as an example. In ICDCS ’03:
Proceedings of the 23rd International Conference on Distributed Computing
Systems, page 522, Washington, DC, USA, 2003. IEEE Computer Soci-
ety.
[HLM06] Maurice Herlihy, Victor Luchangco, and Mark Moir. A flexible frame-
work for implementing software transactional memory. In OOPSLA
’06: Proceedings of the 21st annual ACM SIGPLAN conference on Object-
oriented programming systems, languages, and applications, pages 253–262,
New York, NY, USA, 2006. ACM Press.
[HLMWNS03] Maurice Herlihy, Victor Luchangco, Mark Moir, and III William
N. Scherer. Software transactional memory for dynamic-sized data
structures. In PODC ’03: Proceedings of the twenty-second annual sym-
posium on Principles of distributed computing, pages 92–101, New York,
NY, USA, 2003. ACM Press.
[HMPJH05] Tim Harris, Simon Marlow, Simon Peyton-Jones, and Maurice Herlihy.
Composable memory transactions. In PPoPP ’05: Proceedings of the tenth
ACM SIGPLAN symposium on Principles and practice of parallel program-
ming, pages 48–60, New York, NY, USA, 2005. ACM Press.
[HP96] John L. Hennessy and David A. Patterson. Computer architecture (2nd
ed.): a quantitative approach. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 1996.
[HPST06] Tim Harris, Mark Plesko, Avraham Shinnar, and David Tarditi. Opti-
mizing memory transactions. In PLDI ’06: Proceedings of the 2006 ACM
SIGPLAN conference on Programming language design and implementation,
pages 14–25, New York, NY, USA, 2006. ACM Press.
[int] Intel 64 and IA-32 Architectures Software Developer’s Manual.
http://www.intel.com/products/processor/manuals/index.htm.
120
BIBLIOGRAPHY
[LC07] Joa˜o Lourenc¸o and Gonc¸alo Cunha. Testing patterns for software trans-
actional memory engines. In PADTAD ’07: Proceedings of the 2007 ACM
workshop on Parallel and distributed systems: testing and debugging, pages
36–42, New York, NY, USA, 2007. ACM Press.
[McK05] Paul E. McKenney. Memory ordering in modern microprocessors,
Part I. Linux J., 2005(136):2, 2005.
[MCS91] John M. Mellor-Crummey and Michael L. Scott. Algorithms for scal-
able synchronization on shared-memory multiprocessors. ACM Trans.
Comput. Syst., 9(1):21–65, 1991.
[NSS] Dave Dice Nir Shavit and Ori Shalev. Transactional Locking II—Slides
from TRANSACT06.
[SATH+06] Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson, Chi Cao
Minh, and Benjamin Hertzberg. McRT-STM: a high performance soft-
ware transactional memory system for a multi-core runtime. In PPoPP
’06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles
and practice of parallel programming, pages 187–197, New York, NY, USA,
2006. ACM Press.
[sch] Standard C Library Functions - schedctl init Solaris man page.
[ST95] Nir Shavit and Dan Touitou. Software transactional memory. In PODC
’95: Proceedings of the fourteenth annual ACM symposium on Principles of
distributed computing, pages 204–213, New York, NY, USA, 1995. ACM
Press.
[Sut05] Herb Sutter. The free lunch is over. Dr. Dobb’s Journal, 03 2005.
[WG] D. Weaver and T. Germond. The SPARC Architecture Manual, Version 9.
PTR Prentice Hall, Englewood Cliffs, New Jersey 07632.
[WNSS05] III William N. Scherer and Michael L. Scott. Advanced contention man-
agement for dynamic software transactional memory. In PODC ’05:
Proceedings of the twenty-fourth annual ACM symposium on Principles of
distributed computing, pages 240–248, New York, NY, USA, 2005. ACM
Press.
121
