ABSTRACT As the many-core processors become more prevalent, the parallelism degree of applications is rapidly increasing. It is well known that multi-thread approaches are an effective solution to improve performance by exploiting multiple cores. However, the synchronization problem that occurs between multiple threads can limit the concurrency and scalability of applications. Hardware transactional memory (HTM) has been studied to simplify the synchronization problem, and Intel adopted transactional synchronization extensions (TSX) for its processors in the year 2012. TSX can dynamically decide and perform instructions as an atomic transaction. In this paper, we evaluate and analyze the performance of TSX. It is expected that the latest technology implementing HTM to cope with synchronization scalability will be a nice solution for handling the high degree of parallelism. We found two major reasons that cause performance degradation and propose a novel approach to address these more effectively based on our analysis. We also introduced a mechanism, named ParaTM, to transparently adopt TSX for existing lock-based applications. By using ParaTM, one can apply TSX features without modification of the code. From our evaluation using a microbenchmark and real-world applications, we confirmed ParaTM is highly effective for transparency and performance. ParaTM achieved 1.75x, 4.76x, and 1.53x better performance compared to the traditional lock mechanism for LevelDB, RocksDB, and Memcached, respectively.
I. INTRODUCTION
As multi-/many-core processors become more prevalent, the parallelism degree of applications is rapidly increasing [5] . Especially, in the areas of high performance computing (HPC) such as deep learning, big data processing, and bioinformatics, having hundreds of threads and processes is a common configuration. With these computationintensive workloads, it is expected that many-core systems and massively parallel applications will become more popular. Unfortunately, this computing trend will intensify the problem of synchronization between processes and threads, and the development of mechanisms and algorithms to address them is having trouble keeping up with this explosive increase, especially in terms of scalability. Applications that are currently experiencing synchronization problems, such as key-value stores and parallel service servers, are particularly representative. For example, LevelDB [1] , RocksDB [3] , and Memcached [2] use many internal lock mechanisms, and it is known that this significantly affects the overall system performance.
In environments where a high degree of parallelism is required, the overall performance of applications can be significantly degraded by a very small sequential part. According to Amdahl's law, if the sequential part is increased from 1% to 2%, the overall performance may be reduced from 50 to 33 times [17] . However, when a large number of parallel processes or threads access shared resources, a synchronization mechanism is inevitably used to protect the critical section. It is also well known that the performance degradation can vary largely depending on which synchronization mechanism is used. Thus, many synchronization mechanisms have been proposed.
As naive approaches, lock-based synchronization algorithms to manage the accesses to shared resources were introduced, but most of them have suffered from deadlock, starvation, and low scalability issues. Then, lock-free algorithms and data structures were suggested to avoid and mitigate the deadlock and blocking problems [12] , [13] , [23] , [24] . Although they show high concurrency without lock-based mutual exclusion, lock-free algorithms increase the application complexity. Moreover, they still require more scalability to meet future many-core parallelisms.
On the other hand, transactional memory (TM) [14] is a lock-free approach to simplify synchronization problems. TM guarantees the atomic execution of instructions. Similar to the transactions of database systems, TM commits the execution of instructions to memory if the instructions are executed without conflicts. In the case of conflicts, the execution is rolled back and tried again later. It is known that TM can be effective when implemented with hardware support, because the cache coherence protocol among CPU cores can detect any conflicts easily. TM has the potential to solve various problems of existing synchronization schemes, and therefore several TMs with hardware support have been proposed [6] , [9] , [16] , [27] .
In the year 2012, Intel adopted a new feature named transactional synchronization extensions (TSX) [7] , a hardwarebased TM which dynamically decides and performs the instructions for serialization. Although Intel disabled this feature for the initial versions of TSX-support CPUs due to a hardware bug [4] , [15] , most of current Intel processors now support the TSX functionality. However, there are some problems and limitations regarding the use of TSX. First, when an abort occurs during a transaction, this results in overhead degrading performance. Therefore, several studies and optimization methods have been proposed to overcome this problem [11] , [30] . Second, TSX does not guarantee that the transaction will ever commit due to hardware limitations. Therefore, an alternative synchronization mechanism must be implemented to account for situations where TSX transactions cannot be committed. Finally, applying TSX to existing applications requires a lot of re-engineering cost. The developer must check the application code, modify the code, and rebuild it.
In this paper, we evaluate and analyze the performance of TSX because it is the latest technology implementing hardware-based TM to cope with synchronization scalability and is expected to be a nice solution to handle the high degree of parallelism. Based on our analysis, we also propose a mechanism to utilize TSX more effectively, named ParaTM. By using ParaTM, one can transparently apply TSX to traditional applications without modification.
The contributions of this paper can be summarized as follows.
• We analyzed the performance when using TSX. We used a micro-benchmark to compare the spinlock and the mutex of pthread with TSX. Based on these, we provide guidelines and development choices on whether using TSX is appropriate.
• We proposed ParaTM which can apply TSX without modifying or rebuilding traditional applications. Since modifying traditional applications is expensive or impossible in some cases, such as legacy applications, ParaTM can offer the advantage of easy adoption of TSX. In addition, a developer can easily test an application to decide whether to adopt TSX or not, as applying TSX to an application does not always enhance its performance due to the limitations of TSX.
• ParaTM is optimized to minimize abort cases of TSX. In addition to minimizing the abort of the L1 cache dirty when applying TSX, it is necessary to review and modify all the codes that uses locks in order to further improve the performance. ParaTM provides an additional interface for a developer to review the code and select whether to apply TSX for each lock. For example, if the abort will be clear, if the critical section is large, or if the probability of abort is high, not using TSX is the better choice for performance. This can be reflected by changing only one line of code before locking a critical section.
• We evaluated the performance of ParaTM using realworld applications. We applied ParaTM to LevelDB, RocksDB, and Memcached as is, and achieved 1.75x, 4.76x, and 1.53x better performance, respectively.
Section II introduces TSX and discusses the problems and limitations of TSX. In Section III, we show how to use TSX more effectively based on a case study and describe ParaTM in detail for the optimization and easy adoption of TSX. Section IV demonstrates the performance impacts of ParaTM on a micro-benchmark and real-world applications, and Section V discusses the differences with related works and the limitations of ParaTM. Finally, Section VI concludes this paper and suggests plans for future work.
II. A CASE STUDY A. INTEL TRANSACTIONAL SYNCHRONIZATION EXTENSIONS
With TSX, the code area specified by a developer is executed in an atomic transaction. If the transactional execution is completed successfully, all memory operations are atomically committed. However, if the transactional execution cannot be completed successfully, a transaction abort occurs. In that case, all updates are cancelled and the execution should be retried later.
Although a transactional abort can occur for many reasons, one of the primary reasons is conflicting accesses between the different processes or threads during their transactional execution. TSX consists of the memory data of read operations in the transactional region as read-set and the memory data of write operations as write-set. Read-set and write-set are managed in the cache lines of the L1 cache. A conflict 45418 VOLUME 6, 2018 occurs when other processes read data from write-set, or when other processes write data to read-set or write-set in the transactional region. When a transactional abort occurs, the processor state is restored, and the transaction should be retried using its fallback codes.
TSX provides two software interfaces: hardware lock elision (HLE) and restricted transactional memory (RTM). HLE is a legacy compatible instruction set extension to run critical sections transactionally by using the XACQUIRE and XRELEASE functions. RTM is more useful for implementing a transactional execution since it provides more flexible software interfaces and programmable fallback paths. RTM supports three new instructions to start, commit, and abort transactional execution: XBEGIN, XEND, and XABORT.
Unfortunately, TSX does not guarantee a transactional execution will ever be committed. For example, if transactional resources in the transactional region are exceeded, or if the attempts to execute a system calls in a transactional region, the transaction will always be aborted. Thus, developers must provide an alternative method (e.g., a traditional lock mechanism) in the fallback path. Figure 1 provides example code of applying RTM to obtain and release a lock. RTM can operate similarly to the lock-free mechanism, which executes a transaction without locks. If a transaction is aborted for any reason, it is retried at most userdefined times (line [3] [4] [5] [6] [7] [8] . When all attempts fail, the execution acquires a lock by using a traditional lock mechanism (line 9). To ensure correctness, the lock should be checked during the transactional execution and it should be aborted if the lock is not free. Figure 2 is an example code when applying TSX for a critical section. It is similar to traditional lock mechanism codes. 
B. MICRO-BENCHMARK WITH TSX
For the performance evaluation and analysis of TSX, we performed the sysbench thread benchmark [26] . TSX cannot be applied directly to the off-the-shelf sysbench thread benchmark because each thread calls the yield system call. Calling a system call within a transaction results in TSX being aborted. Therefore, we modified the sysbench thread source code. In the modified sysbench, all threads are divided into two groups. One group increases the value by sequentially traversing a shared 4-byte integer array, and the other group sequentially reads the shared integer array. Each thread performs this task 100,000 times. FIGURE 3. Execution time of the sysbench thread benchmark with TSX, the pthread mutex and the pthead spinlock. TSX shows lower performance than the pthread mutex. On the contrary to others, the pthread spinlock suffers from scalability when the number of threads is larger than that of physical cores.
For this evaluation, we used an Intel Core i7-7700, which has 4 cores without hyperthreading, and Figure 3 shows the evaluation results. The pthread spinlock showed the best performance for single thread to four threads, and its performance drastically decreases as the number of threads increases. The pthread spinlock suffers from spinning to get a lock when the number of threads is larger than that of the physical cores. On the other hand, the performance of pthread mutex is slightly better than that of TSX. We concluded that the abort operations from data conflicts is the major reason for low performance, as shown in Figure 4 . Although it shows no regular pattern, the number of aborts is more than 10M from 4 threads. These abort operations of TSX incur substantial overhead, and moreover their retries exacerbate the overhead. 
III. TRANSPARENT EMBEDDING OF TSX
In this section, we present ParaTM in detail. First we describe its design choices and suggest optimization techniques. Then, we present how to apply ParaTM to traditional applications, and exceptional considerations and the implementation of ParaTM are addressed.
A. ANALYSIS
From Section II, we found that TSX is not a magic hammer, i.e., it cannot always improve the performance of parallel applications with the simple adoption of TSX. Moreover, for the adoption of TSX to a traditional application, the modification and re-engineering of code were inevitable. In order to overcome these limitations, we analyzed them in detail below.
The main reason that TSX does not improve the performance is due to the overhead of TSX being aborted. The abort operation of TSX can be incurred by data collisions. This means that two or more threads/processes access the same data, and one of them should be cancelled. It is natural and reasonable that operations can be used to maintain synchronization. Aside from data collision, another reason for aborts is that they crosses the protection boundary via system calls in a critical section. This is because when a system call is called, the cache contents are emptied and the cached TSX data cannot be used. Thus, a developer should not use TSX in critical sections when system calls are frequent. In other cases, the L1 cache become dirty due to irrelevant data accesses, and then TSX should be aborted. This is because the transaction data is managed by the size of the cache line. In order to avoid these conflicts, the data must be aligned in the cache line size. Finally, the data in the transaction exceeds the allowable size of the cache. In this case, it might be better not to use TSX.
On the other hand, applying TSX to traditional applications requires a great deal of re-engineering. This is because the developers need to understand the entire application structure and modify each line of code. Developers should also consider whether the TSX should be adopted or not for better performance. For example, if the abort of TSX always happens or frequently occurs within a critical section (e.g., system call, large number of operations, large memory usage), it might be better to use other synchronization mechanisms instead of TSX.
B. DESIGN
We propose ParaTM, which is an easy mechanism to apply TSX for traditional applications, while addressing the problems mentioned in the previous sections. One of major obstacles for adopting of TSX is the re-engineering cost, because the program codes should be modified and rebuilt to apply TSX for existing applications. The re-engineering of such applications is expensive work and even impossible in some cases. ParaTM can provide an approach to avoid this overhead, and optimization methods can enhance the performance of massively parallel applications.
1) EMBEDDING TO TRADITIONAL APPLICATIONS
Most parallel applications use the pthread library and its mutex lock mechanism since these are a de facto standard, due to their compatibility with most systems and their wellknown high efficiency. Thus, ParaTM focuses on the pthread mutex mechanism and is designed to have no need for the modification of existing applications that use the pthread library. ParaTM executes TSX instead of the pthread mutex mechanisms. Using the LD_PRELOAD environment variable, the pthread library code can be replaced with the TSX enable code in ParaTM, and TSX can be applied transparently. This selection mechanism provides more performance enhancement, and the detailed scheme is described in the optimization section. When an application should synchronize data using TSX, ParaTM executes the codes to enable TSX (path 2). If not, the function call is forwarded to the pthread library, and then the normal pthread_mutex_lock() of the pthread library code will be executed (path 3).
2) OPTIMIZATION
ParaTM will replace the traditional application with TSX transparently and is designed to optimize performance further. First, adopting the batching scheme for iterative transactions can heavily reduce the synchronization overhead. Although batching transactions are commonly used techniques, a larger performance improvement can be obtained with TSX than the pthread mutex. We actually confirmed that the performance of TSX was greatly increased when the batch option was applied to Memcached.
The second approach is aligning data to be fit into the cache line size. This optimization contributes to reducing the number of transactional aborts. However, due to the low cache efficiency, the performance can be degraded when the data access pattern is sequential.
The last technique is to providing an option to choose TSX or the pthread mutex. As described, the system performance will decrease when the abort of TSX is clearly expected, such as calling system calls from a critical section. To address this, ParaTM provides the option of executing a critical section with TSX or the pthread mutex. However, the re-engineering of the application code is necessary in this case. Developers can easily use this option by adding a line of code before entering a critical section. Although this optimization requires modification of the applications, it may not be too much of a burden and has potential for more performance gain.
3) CONDITION VARIABLE
When ParaTM replaces the pthread mutex with TSX, implementation of the condition variable should be carefully considered, since the condition variable mechanism calls the system call during its execution, which incurs the transaction abort of TSX. Thus, execution of the condition variable must be performed after the transaction is committed to avoid aborting.
ParaTM places a separate buffer to contain the operations related to the conditional variables. All conditional operations are stored in the buffer during the transaction and the commands are flushed once after the transaction is committed. At this moment, the indexes of the buffer located in the same cache line may collide with each other. Therefore, the size of each index of the buffer is aligned with the cache line size to prevent collision.
4) ALTERNATIVE SYNCHRONIZATION METHOD
TSX does not guarantee that the transaction will ever be committed, and therefore an alternative synchronization mechanism must be prepared as a fallback path. In ParaTM, a spinlock mechanism is used in consideration of the performance gain. The fallback path mechanism of ParaTM is implemented by an atomic operation using part of the mutex variable.
C. IMPLEMENTATION Figure 6 shows the implementation of the pthread_mutex_ lock() function in ParaTM. When the function is called from a user application, ParaTM redefines the pthread_mutex_lock() function of the pthread library. The first thing to do in the function is to decide whether to use TSX or not (line [3] [4] . If the use of TSX is disabled, the function calls the pthread_mutex_lock_org() function, which is dynamically linked to the traditional pthread_mutex_lock() function of the pthread library. When using TSX, ParaTM executes the codes to obtain a lock by executing TSX specific functions (line 6-18). To improve the performance, TSX is not retried if the transaction resources are exceeded. In the other case, the transaction is retried for the predefined number of times. Finally, if all transaction retries are failed, the alternative synchronization method is performed (line 20). ParaTM commits the transaction and flushes all of the condition variables stored in the buffer as mentioned the previous subsection (line 10) .
To disable TSX in ParaTM and avoid confident TSX aborts, one should insert a line of code before calling the pthread_mutex_lock function. This line of code sets a specific bit of the mutex variable, and ParaTM distinguishes between TSX and the pthread mutex by checking this specific bit. This specific bit is a part of a mutex variable and is used for condition variables in the pthread library. Note that this minor trick works without any problem of conditional variables, since ParaTM implements its own condition variable mechanism.
D. EMBEDDING TO LEVELDB, ROCKSDB, MEMCACHED
To measure how much ParaTM is effective for existing applications, we applied two representative key-value stores, LevelDB and RocksDB, and a highly parallel server, Memcached, which is most widely used as cache server. All the applications are configured to use ParaTM by setting the LD_PRELOAD environment variable without any modification of the applications.
For more performance optimization, the applications can be modified to use TSX or not in ParaTM. LevelDB and RocksDB define and use their own mutex classes in the port library based on the pthread mutex. Therefore, a parameter is added to the lock member function of the mutex class for the user option regarding whether to use TSX. For critical sections that will not use TSX, the parameter value should be set. Memcached uses pthread_mutex without being classed.
IV. PERFORMANCE ANALYSIS A. EVALUATION ENVIRONMENT
All our evaluations are performed on an Intel Core i7-7700 processor with 8GB RAM. The processor is set to use 4 physical threads without hyperthreading. Each core has 32KB L1 cache, 256KB L2 cache, and shared 8MB L3 cache. The cache line size is 64B. We use the GCC 5.4.0 compiler with the mrtm option.
The synchronization methods used for the evaluations are TSX, the pthread mutex, and the pthread spinlock. In the implementation of TSX synchronization, we set that transactions can be retried up to 3 times. If all attempts fail, traditional spinlock is performed.
B. PERFORMANCE OF MICRO-BENCHMARK
To evaluate TSX performance, we performed the sysbench thread benchmark [26] . The detailed configuration is the same as described in Section II. As shown Section II, the performance of the sysbench thread benchmark is lower than for the pthread mutex, and the two major reasons are analyzed in detail below.
The first is due to cache conflict. In TSX, read-set and write-set are managed in the cache line units, and therefore a transaction may cause conflicts even if different indexes are updated. In order to increase the efficiency of TSX, the address of data should be aligned to the cache line size. The other reason is too much fine-grained locking, and therefore we adopt applying the batch scheme for transactional updates. The decreased transaction overhead by batching transactions will reduce the number of conflicts in TSX. Figure 9 show the execution time of TSX with batch sizes of 8 and 64, respectively. Cache aligning is applied to all of them. The x-axis is the number of threads and the y-axis denotes the execution time in seconds. The batch size of 8 shows almost the same performance as with the pthread mutex. With the batch size of 64, TSX performs better than the pthread mutex. In the case of a batch size of 64 with 4 threads, TSX achieves 1.5 times better performance than the pthread mutex. Figure 10 shows the number of transactional aborts during evaluation. As expected, the number of transactional aborts was significantly reduced when batching was applied. FIGURE 11. Execution time of the sysbench thread benchmark using TSX and mutex with increasing batch sizes. The number of threads is set to 4. Figure 11 shows the execution time when increasing the batch size. The number of threads is fixed to 4 threads. As the size of the batch increases, the performance enhancement also increases. Although the batching scheme enhances the performance of both the pthread mutex and TSX, the enhancement of TSX is much larger due to the reduced number of abort operations.
C. LEVELDB
LevelDB is an in-memory key-value store developed by Google and is one of most commonly used key-value stores [1] . To evaluate the performance impact on LevelDB, we used the dbbench benchmark, which is included in the LevelDB software suite and widely used for the evaluation of LevelDB. The dbbench is set to perform fillrand and readrand operations 1,000,000 times using 16B keys and 100B values. Figure 12 and Figure 13 show the evaluation results, where the y-axis is the execution time, which is normalized to the pthread mutex case. In the case of readrand with 2 threads, ParaTM shows the best performance among the three lock mechanisms with a more than 40% decreased execution time.
However, when run with 8 threads, the fillrand case shows worse performance than that of the pthread mutex. In the case of fillrand, twice as many threads than cores incur heavy collisions during their write operations. As shown FIGURE 13. Execution time of LevelDB using the dbbench benchmark for fillrand workload. The y-axis is normalized to the results of the pthread mutex.
FIGURE 14.
Number of abort operations during readrand and fillrand of the dbbench benchmark on LevelDB. Although the number of abort operations in readrand is higher than that in fillrand, the increasing rate of abort operations in fillrand is much higher changing from 4 threads to 8 threads.
the Figure 14 , the number of abort operations for 8 threads is more than 13M. Although the number of abort operations in fillrand is lower than that in readrand, the increasing rate is higher. The number of abort operations in readrand increases about 2 times from 4 threads to 8 threads (from 29M to 62M). In the case of fillrand, the number of abort operations with 4 threads is 4M, and that with 8 threads is 13M (more than 3 times). Moreover, considering the characteristics of the fillrand operations, which are heavy writes, the penalty for an abort operation is much larger than for read operations. Adopting TSX for heavy or long write operations can degrade the performance.
D. ROCKSDB
RocksDB is developed by Facebook, and it is known that RocksDB is used for the database systems of the Facebook web site [3] . Similar to LevelDB, we used the dbbench benchmark to evaluate the performance of ParaTM on RocksDB. We filled RocksDB using fillrand operation before the readrand operation and set the dbbench to perform readrand operations 500,000 times using 8B keys and 256B values. Figure 15 shows the execution times and the results are normalized to the execution time of the pthread mutex cases. For RocksDB, ParaTM shows the best performance when the number of threads is 4 and the ratio of fillrand is 0%, outperforming all the pthead mutex cases except when the number of threads is 1. Considering most key-value servers run using multi-threads to exploit parallelism, the evaluation VOLUME 6, 2018 results confirm ParaTM is highly effective. Similar to the LevelDB evaluation, the performance enhancement decreases as the fill ratio increases.
E. MEMCACHED
Memcached is a general-purpose memory caching server software and is widely used for performance enhancement by caching data and objects in the memory [2] . We use the memaslap benchmark tool to evaluate the performance of Memcached. Memaslap is designed to obtain and set data with a 9:1 ratio 1,000,000 times, and the size is 64B with the 64 multi-get option. Figure 16 shows the evaluation results of Memcached, and the y-axis denotes the throughput, which is normalized to the pthread mutex. The concurrency denotes a configuration value for memaslap and indicates how many requests are executed in parallel. When the number of threads is less than 4, the performance of ParaTM is similar to that of the pthread mutex. However, ParaTM outperforms the pthread mutex by up to 1.5 times. In the case of 4 threads, the throughput is proportional to the number of concurrencies. For more than 4 threads, the performance is saturated.
V. RELATED WORK AND DISCUSSION

A. PERFORMANCE ENHANCEMENT USING TSX
Yoo et al. evaluated the first hardware implementation of TSX using a set of HPC workloads and demonstrated that applying TSX to these workloads could provide significant performance improvements [30] . Another study noted that the simple adoption of TSX failed to overcome the hardware limitations [22] . However, they suggested that applying the TSX to the lock-free algorithm could be effective. The lock-free algorithm uses single word atomic operations, but applying TSX allows multi-word atomic operations. TSX was applied to specific situations and showed performance improvement. Wei et al. proposed DrTM, a fast in-memory transaction processing system that exploits TSX together with RDMA to implement a fast distributed transaction manager [29] . In this work, TSX provided local isolation, while RDMA (which aborts transactions) provided remote isolation. Odaira et al. implemented the TSX loop speculation and achieved 1.9-to 4.4-fold speedups in the NPB programs [25] . Dice et al. [10] described how to use hardware transactional memory (HTM) to simplify work stealing schedulers. Kuszmaul [20] proposed the scalable SuperMalloc, which was an implementation of malloc(3) originally designed for X86 HTM, and reported good performance combined with relatively simple code.
These related studies focused on maximizing the performance of workloads and customized each application. Our study is distinctive in that it focuses on how to easily apply TSX in various environments and workloads without modification, with ParaTM providing reasonable performance enhancement. Although Yoo et al. proposed TSX support for JAVA synchronized sections for transparent adoption of TSX [31] , it is limited within JAVA language.
B. DATABASE SYSTEMS WITH TSX
Many studies to enhance performance using TSX are for database systems due to their high parallelism and performance drops, which are incurred by data synchronization. Leis et al. demonstrated performance improvement by replacing the two-phase locking scheme used in relational databases with TSX [21] . However, the size of data in one transaction is frequently large in relational database systems, and therefore aborts related to TSX can easily occur. They resolved this problem by distributing the transactions. Similarly, Wang et al. [28] proposed the DBX and showed large performance gains on TPC-C style workloads. Karnagel et al. [18] adopted TSX and improved the performance of the B+Tree index and delta tree data structures, which are used for the SAP HANA database system. Diegues and Romano [11] used an automatic tuner for transactions with TSX and achieved a performance improvement twice as large for a key-value store. All these studies enhanced the performance of database systems by specifically optimizing the usage of TSX. ParaTM is designed to provide transparent embedding and adoption of TSX to all kinds of existing applications.
C. LIMITATION AND DISCUSSION
ParaTM will be useful for the adoption of TSX to traditional applications. However, the current implementation of ParaTM is targeted toward the pthread mutex mechanism. By hooking the pthread_mutex_lock/unlock function, ParaTM enables TSX and locks/unlocks a critical section. Although we implemented ParaTM for the pthread mutex mechanism, the mechanism of ParaTM can be easily ported to other lock mechanisms by changing the hooking functions.
Another limitation of ParaTM is further optimization. Transparent embedding of ParaTM will replace all the pthread mutex functions. However, as described in Section III and showed in the LevelDB fillrand case in Section IV, there are several cases where TSX exacerbates the overall performance. To address these, ParaTM provides the option to choose whether to apply ParaTM for each case. Although it requires slight modification (one line of code) of application, this may not be too big a burden for developers.
VI. CONCLUSION
Recently, hardware transactional memory has become easy to utilize due to the release of TSX. In this paper, we compare the performance of TSX with the traditional lock-based algorithm using a micro-benchmark, and find that a naive implementation of TSX shows worse performance than the pthread mutex. We also show that transactional aborts are tightly coupled with performance. Thus, it is advantageous not to use the TSX in critical sections where TSX aborts are always or frequently encountered. Moreover, applying TSX to traditional applications requires a lot of re-engineering costs. To this end, we propose ParaTM, which can transparently adopt TSX without the modification of applications. ParaTM allows developers to easily adopt TSX features for their applications while also providing the option not to apply TSX for each critical section. Through evaluations, we confirmed that ParaTM also shows reasonable performance enhancement of traditional applications without any modification. ParaTM achieved 1.75x, 4.76x, and 1.53x better performance for each LevelDB, RocksDB, and Memcached, respectively.
As future work, we plan to make ParaTM more efficient. First, we need to consider the retry policy. Diegues and Romano [11] introduced a new retry policy to optimize TSX. The retry policy can be an important factor for the performance of TSX. Second, adaptive synchronization mechanisms should be considered. The current ParaTM uses a simple spinlock mechanism for the fallback path of TSX, but several alternative synchronization mechanisms can be applied [8] . Kleen also mentioned the importance of adaptive elision for future work [19] .
