Abstract. Software Transactional Memory (STM) compilers commonly instrument memory accesses by transforming them into calls to STM library functions. Done naïvely, this instrumentation imposes a large overhead, slowing down the transaction execution. Many compiler optimizations have been proposed in an attempt to lower this overhead. In this paper we attempt to drive the STM overhead lower by discovering sources of sub-optimal instrumentation, and providing optimizations to eliminate them. The sources are: (1) redundant reads of memory locations which have been read before, (2) redundant writes to memory locations which will be subsequently written to, (3) redundant writeset lookups of memory locations which have not been written to, and (4) redundant writeset record-keeping for memory locations which will not be read. We describe how static analysis and code motion algorithms can detect these sources, and enable compile-time optimizations that signicantly reduce the instrumentation overhead in many common cases. We implement the optimizations over a TL2 Java-based STM system, and demonstrate the eectiveness of the optimizations on various benchmarks, measuring up to 29-50% speedup in a single-threaded run, and up to 19% increased throughput in a 32-threads run.
Introduction
Software Transactional Memory (STM) [12, 20] is an emerging approach that provides developers of concurrent software with a powerful tool: the atomic block, which aims to ease multi-threaded programming and enable more parallelism.
Conceptually, statements contained in an atomic block appear to execute as a single atomic unit: either all of them take eect together, or none of them take effect at all. In this model, the burden of carefully synchronizing concurrent access to shared memory, traditionally done using locks, semaphores, and monitors, is relieved. Instead, the developer needs only to enclose statements which access shared memory by an atomic block, and the STM implementation guarantees the atomicity of each block.
In the past several years there has been a urry of software transactional memory design and implementation work; however, with the notable exception of transactional C/C++ compilers [19] , many of the STM initiatives have remained academic experiments. There are several reasons for this; major among them is the large performance overhead [6] . In order to make a piece of code transactional, it usually undergoes an instrumentation process which replaces memory accesses with calls to STM library functions. These functions handle the book-keeping of logging, committing, and rolling-back of values according to the STM protocol. Naïve instrumentation introduces redundant function calls, for example, for values that are provably transaction-local. In addition, an STM with homogeneous implementation of its functions, while general and correct, will necessarily be less ecient than a highly-heterogeneous implementation: the latter can oer specialized functions that handle some specic cases more eciently. For example, a homogeneous STM may oer a general STMRead() method, while a heterogeneous STM may oer also a specialized STMReadThreadLocal() method that assumes that the value read is thread-local, and as a consequence, can optimize away validations of that value.
Previous work has presented many compiler and runtime optimizations that aim to reduce the overhead of STM instrumentation. In this work we add to that body of knowledge by identifying additional new sources of sub-optimal instrumentation, and proposing optimizations to eliminate them. The sources are:
Redundant reads of memory locations which have been read before. We use load elimination, a compiler technique that reduces the amount of memory reads by storing read values in local variables and using these variables instead of reading from memory. This allows us to reduce the number of costly STM library calls.
Redundant writes to memory locations which will be subsequently written to. We use scalar promotion, a compiler technique that avoids redundant stores to memory locations, by storing to a local variable. Similar to load elimination, this optimization allows us to reduce the number of costly STM library calls.
Redundant writeset lookups for memory locations which have not been written to. We discover memory accesses that read locations which have not been previously written to by the same transaction. Instrumentation for such reads can avoid writeset lookup.
Redundant writeset record-keeping for memory locations which will not be read. We discover memory accesses that write to locations which will not be subsequently read by the same transaction. Instrumentation for such writes can therefore be made cheaper, e.g., by avoiding insertion to a Bloom lter.
Not all STM designs can benet equally well from all the optimizations listed;
For example, STMs that employ in-place updates, rather than lazy updates, will see less benet from the redundant memory access optimization. From here on, we restrict the discussion to the TL2 protocol, which benets from all of the optimizations.
In addition to the new optimizations, we have implemented the following optimizations which have been used in other STMs: 1. Avoiding instrumentations of accesses to immutable and transaction-local memory; 2. Avoiding lock acquisitions and releases for thread-local memory; and 3. Avoiding readset population in read-only transactions.
To summarize, this paper makes the following contributions:
We implement a set of common STM-specic analyses and optimizations. We present and implement a set of new analyses and optimizations to reduce overhead of STM instrumentation.
We measure and show that our suggested optimizations can achieve signicant performance improvements -up to 29-50% speedup in some workloads.
We proceed as follows: Section 2 gives a background of the STM we optimize.
In Section 3 we describe the optimization opportunities that our analyses expose.
In Section 4 we measure the impact of the optimizations. Section 5 reviews related work. We conclude in Section 6. Deuce is non-invasive: it does not modify the JVM or the Java language, and it does not require to re-compile source code in order to instrument it.
It works by introducing a new @Atomic annotation. Java methods which are annotated with @Atomic are replaced with a retry-loop that attempts to perform and commit a transacted version of that method. All methods are duplicated; the transacted copy of every method is similar to the original, except that all eld and array accesses are replaced with calls to the Context interface, and all method invocations are rewritten so that the transacted copy is invoked instead of the original.
Deuce works either in online or oine mode. In online mode, the entire process of instrumenting the program happens during runtime. A Java agent is attached to the running program, by specifying a parameter to the JVM. During runtime, just before a class is loaded into memory, the Deuce agent comes into play and transforms the program in-memory. To read and rewrite classes, Deuce uses ASM [4] , a general-purpose bytecode manipulation framework.
In order to avoid the runtime overhead of the online mode, Deuce oers the oine mode, which performs the transformations directly on compiled .class les. In this mode, the program is transformed similarly, and the transacted version of the program is written into new .class les.
Deuce's STM library is homogenous. In order to allow its methods to take advantage of specic cases where optimization is possible, we enhance each of its STM functions to accept an extra incoming parameter, advice. This parameter is a simple bit-set representing information that was pre-calculated and may help ne-tune the instrumentation. For example, when writing to a eld that will not be read, the advice passed to the STM write function will have 1 in the bit corresponding to no-read-after-write.
In this work we focus on the Transactional Locking II (TL2) protocol implementation in Deuce. TL2 is word-based, lock-based and uses the lazy update strategy. Full details can be found at [7] . 3 Optimization Opportunities
The following are optimization opportunities that we have detected.
Preventing Redundant Memory Accesses
Load Elimination Consider the following code fragment that is part of an atomic block (derived from the Java version of the STAMP suite):
f o r ( i n t j = 0 ; j < n f e a t u r e s ; j++) { new_centers We note that this optimization is sound for all STM protocols that guarantee isolation. The performance boost achieved by it, however, is maximized with lazy-update STMs as opposed to in-place-update STMs. The reason is that lazyupdate STMs must instrument every memory access, while in in-place-update STMs it suces to instrument just the rst memory access. Repeated accesses to the same memory locations are transparent.
Using PRE of memory locations with a lazy-update STM, evens the ground: the rst memory access is instrumented, and its value is stored in a temporary 
Preventing Redundant Writeset Operations
Redundant Writeset Lookups Consider a eld read statement v = o.f inside a transaction. The STM must produce and return the most updated value of o.f. In STMs that implement lazy update, there can be two ways to look up o.f's value: if the same transaction has already written to o.f, then the most updated value must be found in the transaction's writeset. Otherwise, the most updated value is the one in o.f's memory location. A naïve instrumentation will conservatively always check for containment in the writeset on every eld read statement. With static analysis, we can gather information whether the accessed o.f was possibly already written to in the current transaction. If we can statically deduce that this is not the case, then the STM may skip checking the writeset, thereby saving processing time.
Redundant Writeset Record-Keeping Consider a eld write statement o.f = v inside a transaction. According to the TL2 protocol, the STM must update the writeset with the information that o.f has been written to. One of the design goals of the writeset is that it should be fast to search it; this is because subsequent reads from o.f in the same transaction must use the value that is in the writeset. But, some memory locations in the writeset will never be actually read in the same transaction. We can exploit this fact to reduce the amount of record-keeping that the writeset data-structure must handle. As an example, TL2 suggests implementing the writeset as a linked-list (which can be eciently added-to and traversed) together with a Bloom lter (that can eciently check whether a memory location exists in the writeset). If we can statically deduce that a memory location is written-to but will not be subsequently read in the same transaction, we can skip updating the Bloom lter for that memory location. This saves processing time, and is sound because there is no other purpose in updating the Bloom lter except to help in rapid lookups. 4 Experimental Results
In order to test the benet of the above optimization opportunities, we used
Deuce [16] , a Java-based STM framework.
PRE optimizations (section 3.1) require no change to the actual Deuce runtime; they only require an extra preliminary optimization pass.
The optimization of preventing redundant writeset operations (section 3.2)) needs to actually change the instrumentation. To do it, we enhance each of
Deuce's STM library methods to accept an extra bit-set parameter, advice, every bit of which denotes an optimization opportunity. Our compile-time analyses discover the opportunities and supply the advice parameters to the STM library method calls. The STM library methods detect the enabled bits in the advice parameters and apply the relevant optimizations. Specically, the STM read method, upon seeing a 1 in the bit corresponding to no-write-before-read, will avoid looking up the memory location in the writeset. Similarly, the STM write function, upon seeing a 1 in the bit corresponding to no-read-after-write, will avoid updating the Bloom lter.
Out test environment is a Sun UltraSPARC T2 Plus multicore machine with 2 CPUs, each with 8 cores at 1.2 GHz, each core with 8 hardware threads to a total of 128 threads.
Optimization Levels
We compared 5 levels of optimizations. The levels are cumulative in that every level includes all the optimizations of the previous levels. The None level is the most basic code, which blindly instruments every memory access. The Common level adds several well-known optimizations that are common in STMs. These include 1. Avoiding instrumentations of accesses to immutable and transactionlocal memory; 2. Avoiding lock acquisitions and releases for thread-local memory; and 3. Avoiding readset population for read-only transactions. The PRE level consists of load elimination and scalar promotion optimizations. The ReadOnly level avoids redundant readset lookups for memory locations which have not been written to. Finally, the WriteOnly level avoids redundant writeset record-keeping for memory locations which will not be read.
Benchmarks
We experimented on a set of data structure-based microbenchmarks and several benchmarks from the Java version of the STAMP [5] MatrixMul is part of the Java version of the STAMP suite. It performs matrix multiplication.
In the STAMP benchmarks we measured the time it took for each test to complete. 
Optimization Opportunities Breakdown
To understand to what extent optimizations are applicable to the benchmarks, we compared optimization-specic measures on single-threaded runs. The results appear in tables 1, 2. The measure for PRE is the percent of reads eliminated by load elimination and scalar promotion (compared to the Common level). The measure for ReadOnly is the percent of read statements that access memory locations which have not been written to before in the same transaction. The measure for WriteOnly is the percent of write statements that write to memory which will not be read in the same transaction. All numbers are measured dynamically at runtime. High percentages represent more optimization opportunities. Low percentages mean that we could not locate many optimization opportunities, either because they do not exist, or because our analyses were not strong enough to nd them. 
Analysis
Our benchmarks show that the optimizations have improved performance to varying degrees. The most noticeable performance gain was due to PRE, especially in tight loops where many memory accesses were eliminated.
PRE K-Means benets greatly (up to 29% speedup) from load elimination: the above example (section 3.1) is taken directly from K-Means. MatrixMul also benets from PRE due to the elimination of redundant reads of the main matrix object. Vacation achieves 4% speedup in the single-threaded run, but sees little to no speedup as the number of threads rises; this is because the eliminated loads exist outside of tight loops.
Our Scalar Promotion analysis, which focuses on nding loops where the same memory location is re-written in every iteration, was not able to nd this pattern in any of the tested benchmarks. A more thorough analysis, that also considers writes outside of loops, may have been able to detect some opportunities for enabling the Scalar Promotion optimization.
ReadOnly LinkedList benets (throughput increased by at up 28%) from the ReadOnly optimization, which applies to reading the next node in the list. This optimization is valid since traversal is done prior to updating the next node.
We note that the read of the head of the list is also a read-only memory access; however this is subsumed by the Common optimizations because the head is immutable. Hash's throughput is increased by up to 4% due to ReadOnly opportunities in the findIndex() method, which is called on every transaction.
SSCA2 and MatrixMul see modest benets.
Our analysis discovered that in 4 benchmarks: LinkedList, Hash, SSCA2 and MatrixMul, all reads are from memory locations which have not been written to before in the same transaction. We suspect that reading before writing is the norm in almost all transactions, but our analyses could prove it only in these 4.
WriteOnly compiler optimization techniques (see [14] for a full treatment), such as loop peeling, method inlining, and redundancy elimination algorithms are applied to atomic blocks. Eddon and Herlihy [9] apply fully interprocedural analyses to discover thread-locality and subsequent accesses to the same objects. Such discoveries are exploited for fast path handling of the cases. Similar optimizations also appear in Wang et al. [24] , Dragojevic et al. [8] We note that the above works optimize for in-place-update STMs. In such an STM protocol, once an object is open-for-write, memory accesses to its elds are transparent (free), because the object is exclusively owned by the transaction.
Our work is dierent because it targets lazy-update STMs, where this form of optimization is invalid. A lazy-update STM keeps a writeset where it gathers all the memory location writes that occur within the transaction. It does not update the memory in-place; this happens only at commit time. Therefore, we still need to instrument even subsequent memory accesses, because they cannot transparently access the memory locations themselves. We solve this problem by working with local copies of memory locations, which require no instrumentation.
Spear et al. [22] proposes Beckman et al. [3] 's work provides optimizations for thread-local, transactionlocal and immutable objects that are guided by access permissions. These are Java attributes that the programmer must use to annotate program references.
For example, the @Imm attribute denotes that the associated reference variable is immutable. Access permissions are veried statically by the compiler, and then used to optimize the STM instrumentation for the aected variables.
Partial redundancy elimination ( [15, 18] ) (PRE) techniques are widely used in the eld of compiler optimizations; however, most of the focus was at removing redundancies of arithmetic expressions. Fink et al. [10] and Hosking et al. [13] were the rst to apply PRE to Java access path expressions, for example, expressions like a.b [i] .c . This variant of PRE is also called load elimination. As a general compiler optimization, this optimization may be unsound because it may miss concurrent updates by a dierent thread that changes the loaded value.
Therefore, some works [2, 23] propose analyses that detect when load elimination is valid. Scalar promotion, which eliminates redundant memory writes, was introduced by Lu and Cooper [17] , and improved by later works (e.g. [21] ).
Conclusions and Further Work
We showed that two pre-existing optimizations, load elimination and scalar promotion, can be used in an optimizing STM compiler. Where standard compilers need perform an expensive cross-thread analysis to enable these optimizations, an STM compiler can rely on the atomic block's isolation property to enable them. We also highlighted two redundancies in STM read and write operations, and showed how they can be optimized.
We implemented a compiler pass that performs these STM-specic code motion optimizations, and another pass that uses static analysis methods to discover optimization opportunities for redundant STM read and write operations. We have augmented the interface of the underlying STM compiler, Deuce, to accept information about which optimizations to enable at every STM library method call, and modied the STM methods themselves to apply the optimizations when possible.
The combined performance benet of all the optimizations presented here varies with the workload and the number of threads. While some benchmarks see little to no improvement (e.g., SSCA2 and SkipList), we have observed speedups of up to 50% and 29% in other benchmarks (single-threaded MatrixMul and K-Means, respectively).
There are many ways to improve upon this research. For example, a drawback of the optimizations presented here is that they require full interprocedural analysis to make sound decisions. It may be interesting to research which similar optimizations can be enabled with less analysis work, for example, with running only intraprocedural analyses, or with partial analysis data that is calculated at runtime.
