Cȃlin Caşcaval, Colin Blundell, Maged Michael, Harold W. Cain, Peng Wu, Stefanie Chiras, and Siddhartha Chatterjee
The overhead posed by STM may likely overshadow its promise.
TM (transactional memory)
1 is a concurrency control paradigm that provides atomic and isolated execution for regions of code. TM is considered by many researchers to be one of the most promising solutions to address the problem of programming multicore processors. Its most appealing feature is that most programmers only need to reason locally about shared data accesses, mark the code region to be executed transactionally, and let the underlying system ensure the correct concurrent execution. This model promises to provide the scalability of fi ne-grain locking, while avoiding common pitfalls of lock composition such as deadlock. In this article we explore the performance of a highly optimized STM and observe that the overall performance of TM is signifi cantly worse at low levels of parallelism, which is likely to limit the adoption of this programming paradigm.
Different implementations of transactional memory systems make tradeoffs that impact both performance and programmability. Larus and Rajwar 2 present an overview of design tradeoffs for implementations of transactional memory systems. Here are some of the design choices:
• STM (software-only TM) 3, 4, 5, 6, 7, 8, 9 is the focus of this article. While offering fl exibility and no hardware cost, it leads to overhead in excess of most users' tolerance.
• HTM (hardware-only TM) 10,11,12,13,14,15,16 suffers from two major impediments: high implementation and verifi cation costs lead to design risks too large to justify on a niche programming model; and hardware capacity constraints lead to signifi cant performance degradation when overfl ow occurs, and proposals for managing overfl ows (for example, signatures 17 ) incur false positives that add complexity to the programming model. Therefore, from an industrial perspective, HTM designs have to provide more benefi ts for the cost on a more diverse set of workloads (with varying transactional characteristics) for hardware designers to consider implementation. (Reuse of hardware for other purposes can also justify its inclusion, as may be the case for Sun's implementation of Scout Threading in the Rock processor.
)
• Hybrid systems 19, 20, 21, 22 are the most likely platform for the eventual adoption of TM by a wide audience, although the exact mix of hardware and software support remains unclear. A special case of the hybrid system is the hardware-accelerated STM. In this scenario, the transactional semantics are provided by STM, and hardware primitives are used only to speed up critical performance bottlenecks in the STM system. Such systems could offer an attractive solution if the cost of hardware primitives is modest and may be further amortized by other uses. Independent of these implementation decisions, there are transactional semantics issues that break the ideal transactional programming model for which the community had hoped. TM introduces a variety of programming issues that are not present in lock-based mutual exclusion. For example, semantics are muddled by:
• Interaction with nontransactional codes, including access to shared data from outside of a transaction (tolerating weak atomicity) and the use of locks inside a transaction (breaking isolation to make locking operations visible outside transactions).
• Exceptions and serializability-how to handle exceptions and propagate consistent exception information from within a transactional context, and how to guarantee that transactional execution respects a correct ordering of operations.
• Interaction with code that cannot be transactionalized, as a result of either communication with other threads or a requirement barring speculation.
• Livelock, or the system guarantee that all transactions make progress even in the presence of confl icts.
In addition to the intrinsic semantic issues, there are also implementation-specifi c optimizations motivated by high transactional overheads, such as programmer annotations for excluding private data. Furthermore, the nondeterminism introduced by aborting transactions complicates debugging-transactional code may be executed and aborted on confl icts, which makes it diffi cult for the programmer to fi nd deterministic paths with repeatable behavior. Both of these dilute the productivity argument for transactions, especially software-only TM implementations.
Given all these issues, we conclude that TM has not yet matured to the point where it presents a compelling value proposition that will trigger its widespread adoption. While TM can be a useful tool in the parallel programmer's portfolio, it is not going to solve the parallel programming dilemma by itself. There is evidence that it helps with building certain concurrent data structures, such as hash tables and binary trees. In addition, there are anecdotal claims that it helps with workloads; however, despite several years of active research and publication in the area, we are disappointed to fi nd no mentions in the research literature of large-scale applications that make use of TM. 
SOFTWARE TRANSACTIONAL MEMORY
STM implements all the transactional semantics in software. That includes confl ict detection, guaranteeing the consistency of transactional reads, preservation of atomicity and isolation (preventing other threads from observing speculative writes before the transaction succeeds), and confl ict resolution (transaction arbitration). software transactional memory why is it only a research toy?
The pseudocode for the main operations executed by a typical STM is illustrated in fi gure 1. It shows two STM algorithms: one that performs full validation and one that uses a global version number (the additional statements marked with the gv# comment). The advantage of STM for system programmers is that it offers fl exibility in implementing different mechanisms and policies for these operations. For end users, the advantage of STM is that it offers an environment to transactionalize (i.e., port to TM) their applications without incurring extra hardware cost or waiting for such hardware to be developed.
On the other hand, STM entails nontrivial drawbacks with respect to performance and programming semantics:
Overheads. In general, STM results in higher sequential overheads than traditional shared-memory programming or HTM. This is the result of the software expansion of loads and stores to shared mutable locations inside transactions to tens of additional instructions that constitute the STM implementation (for example, the STM_READ code in fi gure 1). Depending on the transactional characteristics of a workload, these overheads can become a high hurdle for STM to achieve performance. The sequential overheads (that is, confl ict-free overheads that are incurred regardless of the actions of other concurrent threads) must be overcome by the concurrency-enabling characteristics of transactional memory.
Semantics. To avoid incurring high STM overheads, nontransactional accesses (i.e., loads and stores occurring outside transactions) are typically not expanded. This has the effect of weakening-and hence complicating-the semantics of transactions, which may require the programmer to be more careful than when strong transactional semantics are supported. The following are some of the weakened guarantees that are usually associated with such STMs:
• Weak atomicity. Typically, the STM runtime libraries cannot detect confl icts between transactions and nontransactional accesses. Thus, the semantics of atomicity are weakened to allow undetected confl icts with nontransactional accesses (referred to as weak atomicity 28 ), or equivalently put the burden on the programmer to guarantee that no such confl icts can possibly take place.
• Privatization. Some STM designs prohibit the seamless privatization of memory locations (that is, the transition from being accessed transactionally to being accessed privately-or nontransactionally in general, by using locks). For some STM designs, once a location is accessed transactionally, it must continue to be accessed that way. Sometimes, the programmer can ease the transition by guaranteeing that the fi rst access to the privatized location-such as after the location is no longer accessible by other threads-is transactional.
• Memory reclamation. Some STM designs prohibit the seamless reclamation of the memory locations accessed transactionally for arbitrary reuse, such as using malloc and free. With such STM designs, memory allocation and deallocation for locations accessed transactionally are handled differently than for other locations.
• Legacy binaries. STM needs to observe all memory activities of the transactional regions to ensure atomicity and isolation. STM designs that achieve this observation by code instrumentation generally cannot support transactions calling legacy codes that are not instrumented (for example, third-party libraries) without seriously limiting concurrency, such as by serializing transactions.
EVALUATION
Throughout this section we use the following set of benchmarks:
• b+tree is an implementation of database indexing operations on a b-tree data structure for which the data software transactional memory why is it only a research toy? 
is stored only on the tree leaves (a b+ tree). This implementation uses coarse-grained transactions for every tree operation. Each b+ tree operation starts from the tree root and descends to the leaves. A leaf update may trigger a structural modification to rebalance the tree. A rebalancing operation often involves recursive ascent over the child-parent edges. In the worst case, the rebalancing operation modifies the entire tree. Our workload inserts 2,048 items in a b+ tree of order 20. For this code we have only a transactional version that is not manually instrumented; therefore, experimental results are presented only in configurations where we can use our compiler to provide instrumentation.
• delaunay implements the Delaunay Mesh Refinement algorithm described in Kulkarni et al. 29 The code produces a guaranteed quality Delaunay mesh. This is a Delaunay triangulation with the additional constraint that no angle in the mesh be less than 30 degrees. The benchmark takes as input an unrefined Delaunay triangulation and produces a new triangulation that satisfies this constraint.
In the TM implementation of the algorithm, multiple threads choose their elements from a work queue and refine the cavities as separate transactions.
• genome, kmeans, and vacation are part of the STAMP benchmark suite 30 version 0.9.4. For a detailed description of these benchmarks, see STAMP.
31
Baseline performance. Figure 2 presents a performance comparison of three STMs: IBM, 32,33 Intel, 34 and Sun TL2.
35
The runs are on a quad-core, two-way hyperthreaded Intel Xeon 2.3-GHz box running Linux Fedora Core 6. In these runs, we used the manually instrumented versions of the codes, which aggressively minimize the number of barriers for the IBM and TL2 STMs. Since we do not have access to low-level APIs for the Intel STM, its curves are from codes instrumented by the Intel STM compiler, which incurs additional barrier overheads as a result of compiler instrumentation. 36 The graphs are scalability curves with respect to the serial, nontransactionalized version. Therefore, a value of 1 on the y-axis represents performance equal to the serial version. The performance of these STMs is mostly on par, with the IBM STM showing better scalability on delaunay and TL2 obtaining better scalability on genome. The overall performance obtained is very low, however: on kmeans the IBM STM barely attains single-threaded performance at four threads, while on vacation none of the STMs actually overcomes the overhead of transactional memory even with eight threads.
Compiler instrumentation. The compiler is a necessary component of an STM-based programming environment that is to be adopted by mass programmers. Its basic role is to eliminate the need for programmers to manually instrument memory references to STM read and write barriers. While offering convenience, compiler instrumentation does add another layer of overheads to the STM system by introducing redundant barriers, often resulting from the conservativeness of compiler analysis, as observed in Yoo. We study the performance of two STM algorithms: one that fully validates (fv) the read set after each transactional read and one that uses a global version number (gv#) to avoid the full validation, while maintaining the correctness of the operations. The fv algorithm provides more concurrency at a much higher price. The gv# is deemed as one of the best tradeoffs for STM implementations. Figure 4 presents the single-threaded overhead of these algorithms over sequential runs, illustrating again the substantial slowdowns that the algorithms induce. Figure 6 gives a fine-grained breakdown of the overheads of the transactional read operation. As expected, the overhead of validating the read set dominates transactional read time in the fv configuration. For both algorithms, the isync operations (necessary for ordering the metadata read and data read, as well as the data read and validation) form a substantial component. In applications that perform writes before reads in the same transaction (delaunay, kmeans), the time spent checking whether a location has been written by prior transactional writes in the same transaction forms a significant component of the total time. Interestingly, reading the data itself is a negligible amount of the total time, indicating the hurdles that must be overcome for the performance of these algorithms to be compelling. Figure 7 gives a similar breakdown of the transactional commit operation. As before, the fv configuration suffers from having to validate the read set. Other dominant overheads for both configurations are those of having to acquire the metadata for the write set (which involves a sequence of load-linked/store-conditional operations) and the sync operations that are necessary for ordering the metadata acquires, data writes, and metadata releases. Once again, the data writes themselves form a small component of the total time.
Overhead optimizations. There have been many proposals on reducing STM overheads through compiler or runtime techniques. Most of these techniques are complementary to hardware acceleration for STM.
• Redundant barrier elimination. One technique is to eliminate barriers to thread-local objects through escape analysis. Such analysis is typically quite effective in identifying thread-local accesses that are close to the object allocation site. It can eliminate both read and write barriers but is often more effective on write barriers. For example, we observe that an intraprocedural escape analysis can eliminate 40 to 50 percent of write barriers in vacation, genome, and b+tree. Its impact on performance is more limited, however: from negligible to 12 percent. To target redundant read barriers, a whole-program analysis called Not-AccessedIn-Transaction 39 eliminates some barriers to read-only objects in transactions.
• Barrier strength reduction. These optimizations do not eliminate barriers but identify at runtime special locations that require only lightweight barrier processing, such as dynamic tracking of thread-local objects 40,41 and runtime filtering of stack references and duplicate references.
42
• Code generation optimizations. One common technique is to inline the fast path of barriers. It has the potential benefit of reducing function-call overhead, increasing ILP, and exposing reuse of common sub-barrier operations. In our experiments, compiler inlining achieved less than 2 percent overall improvement across our benchmark suite. 
RELATED WORK
The fi rst STM system was proposed by Shavit and Touitou 44 and is based on object ownership. The protocol is static, which is a signifi cant shortcoming that has been overcome by subsequently proposed STM systems. 45 Confl ict detection is simplifi ed signifi cantly by the static nature because confl icts can be ruled out already when ownership records are acquired (at transaction start). DSTM 46 is the fi rst dynamic STM system; the design follows a per-object runtime organization (locator object). Variables (objects) in the application heap refer to a locator object. Unlike in a design with ownership records (for example, Harris and Fraser 47 ), the locator does not store a version number but refers to the most recently committed version of the object. A particularity of the DSTM design is that objects must be explicitly opened (in read-only or read-write mode) before transactional access; also, DSTM allows for early release. The authors argue that both mechanisms facilitate the reduction of confl icts. The design principles of the RSTM 48 (Rochester STM) system are similar to DSTM in that it associates transactional metadata with objects. Unlike DSTM, however, the system does not require the dynamic allocation of transactional data but colocates it with the nontransactional data. This scheme has two benefi ts: fi rst, it facilitates spatial access locality and hence fosters execution performance and transaction throughput; second, the dynamic memory management of transactional data (usually done through a garbage collector) is not necessary, and hence this scheme is amenable to use in environments where memory management is explicit. Recent work has explored algorithmic optimizations and/or alternative implementations of the basic STM algorithms described in this article. 
CONCLUSION
Based on our results, we believe that the road ahead for STM is quite challenging. Lowering the overhead of STM to a point where it is generally appealing is a difficult task, and significantly better results have to be demonstrated. If we could stress a single direction for further research, it is the elimination of dynamically unnecessary read and write barrierspossibly the single most powerful lever toward further reduction of STM overheads. Given the difficulty of similar problems explored by the research community such as alias analysis, escape analysis, and so on, this may be an uphill battle. Because the argument for TM hinges upon its simplicity and productivity benefits, we are deeply skeptical of any proposed solutions to performance problems that require extra work by the programmer.
We observed that the TM programming model itself, whether implemented in hardware or software, introduces complexities that limit the expected productivity gains, thus reducing the current incentive for migration to transactional programming and the justification at present for anything more than a small amount of hardware support. Q 
ACKNOWLEDGMENTS

