Abstract-Hardware signatures for Transactional Memory (TM) systems have been proposed as an efficient mechanism for conflict detection, an essential element in TM for maintaining correctness. A signature misses no conflicts, but could falsely declare conflicts even when no true conflict exists (false positives). In this paper, we show that some false positives can be helpful to the performance by triggering the early abortion of a transaction which would encounter a true conflict later anyway. We propose an adaptive grain signature to improve TM performance by dynamically changing the range of address keys based on the history. With architecture-level simulation and Verilog HDL implementation, we demonstrate that a TM system with our design frequently outperforms baseline TM systems, with marginal area overhead.
INTRODUCTION
With the paradigm shift to Chip Multi-Processors (CMPs), the Transactional Memory (TM) has been considered to be promising paradigm for addressing the difficulty of parallel programming [1] . With TM, programmers just focus on the correctness of their parallel program by defining transactions, code blocks that execute atomically and in isolation. The role of extracting performance is handed over to the underlying TM system. Consequently, TM makes parallel programming much easier and induces better programmer productivity than lock-based synchronization.
A TM system is responsible for concurrency control when multiple transactions execute in parallel [2] . Each transaction is either executed completely (commit) or discarded with no effect (abort).
A conflict occurs when two or more transactions operate concurrently on the same data and at least one of them writes new data. For maintaining the correct concurrency, conflicts should be completely detected and properly resolved. To detect conflicts, the TM system needs to track the data read (read-set) or written (write-set) by each transaction. Hardware Transactional Memory (HTM) systems have detected conflicts with speculative-read and speculativewrite bits by augmenting the cache tag arrays [3] . Recently, hardware signatures have been proposed as an efficient alternative to the tag augmentation [4] . A signature, based on the hardware Bloom filter [5] , is an area-efficient structure for concisely representing a set of addresses. After beginning a transaction, each thread tracks its read-and write-set by inserting the memory addresses of its loads and stores into read-and write-signatures, respectively. During an insertion event, the block address of the memory access is decoded based on a given hash function to calculate a signature index, and the corresponding bit in the signature is set. In Figure 1 , a bit-extraction hash function is illustrated. Bit extraction selects a pre-defined bit-field (address key) from a block address as a signature index. To spread the frequently occurring patterns over all indexes, more efficient hash functions such as bit-permutation hashing [4] , H3 hashing [6] , or XOR-based hashing [7] can be used. Each hash function sets one signature bit per address. When receiving a coherence request, read-and/or write-signatures are tested to check whether a conflict occurs. The signature is accessed with the decoded request address, and the signature declares a conflict if all the signature bits corresponding to the requesting address are set. In our baseline HTM system, LogTM [3] , the receiver sends a negative acknowledgement (nack) to the requester after detecting a conflict. The requester decides whether to stall the running transaction or abort it, based on its conflict resolution policy. The signature is cleared when the transaction commits or aborts.
A signature can represent an unbounded amount of addresses with a fixed hardware scheme and does not permit deleting the inserted addresses until the running transaction finishes. So, conflicts are never missed (no false negatives). However, a signature could signal conflicts even when no actual conflict exists (false positives). The cause of false positives can be classified into aliasing and occupancy [6] . Figure 1 illustrates false positives. When the addresses x and y are different but the corresponding address keys are the same, the test results in a false positive (Figure 1 (a) Aliasing) . A false positive due to occupancy occurs if every bit in the signature for the testing address (y) has already been set by previously-inserted addresses (x 1 and x 2 ), even though the testing address was not inserted (see Figure 1 (b) ).
False positives cause a running transaction to stall or abort.
Stalling due to false conflicts could degrade performance by limiting concurrency. Even worse, aborting due to false conflicts might severely degrade the performance of an eager version management TM system, such as LogTM [3] . In the LogTM system, old data which are overwritten by transactional stores are saved into an undo-log, and new data are stored in the memory hierarchy. If the transaction correctly finishes, then the stored old data are discarded. However, if the transaction aborts, the logged old data should be restored to their appropriate locations for maintaining correctness. A software handler is responsible for this log unrolling, and the resulting performance penalty is non-negligible. To limit the number of transaction aborts, the requester which receives a nack stalls a while instead of immediately aborting and sends a nacked request again with the hope that the conflicting transaction finished. If the conflict is persistent and a deadlock situation occurs (i.e., the younger transaction receives a nack from the older transaction, but the younger one already sent a nack to the other older one), the younger transaction among the conflicting transactions is aborted.
Several signature designs have been proposed to enhance area-efficiency or to improve TM performance [6, 7] . In all these schemes, false positives are assumed to be detrimental to performance, so they attempt to reduce the total number of false positives. However, we observe that some false positives can be helpful to the performance when the signature detects a conflict erroneously but the running transaction incurs a conflict anyway. This false but early conflict detection can improve performance by stopping the execution of a transaction that will eventually encounter a conflict. Based on this observation, we propose a novel signature design, Adaptive Grain Signature (AGSig), to improve performance by increasing the number of performancefriendly false positives as well as decreasing the number of performance-destructive false positives. This adaptive grain signature scheme maintains the history of transaction aborts and dynamically changes the range of the address bit field for calculating signature indexes based on the abort history. The performance of the proposed signature design is compared to that of Page-Block XOR (PBX) signatures [7] with architecture-level simulation. AGSig and PBX signatures are also implemented with Verilog HDL code to estimate delay, power, and area overhead, targeting TSMC 65 nm technology. Through rigorous simulation experiments, we demonstrate that a TM system with our design achieves speedups by up to 75.07%, with an average of 17.19% over baseline TM systems, with marginal area overhead.
II. ADAPTIVE GRAIN SIGANTURE

A. Good False Postives vs. Bad False Positives
We categorize false positives into bad positives and good positives. A false positive that is destructive to the TM performance is called a bad positive. The case when concurrent transactions never conflict, but a signature declares a conflict is classified as a bad positive. However, false positives are not always harmful to performance. A good, or early, positive occurs when a signature declares a false positive in advance and that running transaction eventually incurs a true positive. Even though the signature fails to detect a conflict correctly, this early conflict detection can improve the performance by stopping the execution of an ultimately conflicting transaction.
In eager version management TM systems, the transaction aborts are expensive because the aborted transactions have to be recovered by writing old data back to their appropriate locations. By declaring conflicts early, we can reduce the amount of data that must be restored upon abort. Early declaration of conflicts can also resolve some deadlock situations.
The main reason that the notion of good positives exists is related to memory locality. Figure 2 shows the possible conflict scenarios in eager version management TM systems. In this figure, we assume that transaction TX 1 and TX 2 are older than TX 3 . Figure 2 (a) shows the occurrence of conflicts without false positives. After nacking TX 2 , TX 3 keeps executing until it receives a nack from the older one, TX 1 . Because a deadlock situation occurs, TX 3 should be aborted. In Figure 2 (b), a false but early positive makes TX 3 stall instead. Because of this good positive, the execution of TX 2 can proceed and the deadlock situation can be resolved. Even though TX 3 should be aborted anyway, stopping the execution of TX 3 early can speed up the abort process by reducing the amount of old data logged by TX 3 . If conflicting transactions contain spatial locality among their data accesses (memory accesses b and i in Figure 2 ), we can harvest good positives with coarser-grain signatures. A more detailed analysis of a good positive distribution for STAMP benchmarks [8] can be found in [9] .
B. Proposed Signature Design
To improve TM performance by exploiting good and bad positives, we propose an Adaptive Grain Signature (AGSig) mechanism. AGSig adds a multiplexer, a small register (SEL), and an Abort History Table (AHT) to ordinary signatures. An AHT contains the starting instruction address (program counter, PC) of a transaction which has aborted others. Based on the AHT information, the range of address bit-field as an address key is decided for each transaction, and the multiplexer selects the resulting range. During the execution of each transaction, the same range of address field is used for any insert or test operation. Figure 3 (a) illustrates a simple version of adaptive grain signature (AGSig-S). Initially, the AHT is empty, and all the transactions start with fine-grain signatures (i.e., least significant bit-field of block address is used as an address key). If a transaction aborts other transactions, its starting PC is recorded in the AHT. If the processor encounters that transaction again (AHT hit), the output of AHT is saved in SEL register and that value is used as the select signal of the multiplexer (i.e., the coarse-grain address field is selected). Figure 3 (b) illustrates a more complex adaptive grain signature (AGSig-C) scheme. AGSig-C is similar to AGSig-S except that each AHT entry has a saturation counter. Initially, every counter value is zero. If a transaction aborts others, the starting PC of that transaction is recorded in an AHT entry, and its counter value is set to one. Whenever that transaction conflicts again, the appropriate counter increases. This counter value is used for the select signal of the multiplexer.
The quality of the hash function directly influences the performance of a signature. Even though any kind of hash function can be adapted for AGSig, the hardware overhead must be considered. The proposed signature is implemented by a 2-hash function parallel Bloom filter. We implement an XOR-based hash function for the first one (h 1 ) to increase randomness. Because the randomness of lower-order address bits is sufficient [7] , we use a simple bit-selection hash function for the second one (h 0 ).
The AHT contains the starting PC of transactions that have aborted others. We can detect which transactions abort other transactions by abort messages, which is similar to direct messaging [10] . When a transaction aborts, it informs the killer transaction by sending an abort message. The receiver of this message records the starting PC of its running transaction in its AHT. The number of entries in the AHT is also an important design parameter that affects both the TM performance and hardware overhead. Even though the number of aborts could be very large, the number of AHT entries can be small because the AHT tracks the starting PC of each transaction. We use a 4-entry FIFO as our AHT.
After aborting the other transaction, the starting PC of a killer transaction is recorded into the AHT. AGSig increases its granularity when the killer transaction is encountered again. With AGSig, we can improve the TM performance by two means. First, by changing the granularity to the coarser level (coarser bit-field), the chance to detect good positives is increased due to locality. By stalling the finally-aborted transactions, we can reduce the overhead of saving old data into an undo-log and restoring old data from an undo-log. Second, by using different address bit-fields to access signatures after a conflict detection, false conflicts upon even a first encounter (bad positive) can be reduced. Therefore, the adaptive grain signature design increases the number of good positives and decreases the number of bad positives.
III. PERFORMANCE EVALUATION
We implement the proposed signature designs on the LogTM system provided by the GEMS/Simics simulator [11] . Table I summarizes the main system parameters. We model a 16-core CMP system. Each core has private L1 I-and Dcaches, and a 16-bank, unified L2 cache is shared by all cores. PBX signature is used as the baseline TM system, against which our AGSig scheme is evaluated. The size of both the AGSig and PBX signatures range from 2K-bits to 32K-bits. STAMP benchmarks [8] are used for performance evaluation. Figure 4 shows the speedup of AGSig over PBX. In most cases, AGSig-C produces better performance than PBX or AGSig-S. The maximum speedup of a TM system with AGSig-C is 75.07%, with an average of 17.19%, over a TM system using PBX. A more detailed analysis of performance can be found elsewhere [9] .
IV. SYNTHESIS RESULTS
This section describes delay, power, and area requirements for signatures. AGSig and PBX signatures are developed with Verilog HDL codes and synthesized using Synopsys Design Compiler, targeting a TSMC 65 nm technology. CACTI5.3 [12] is used to evaluate Bloom filters. The Bloom filters are implemented as SRAMs with one read-port and one writeport, targeting 65 nm technology. Because of the limitation of CACTI, we evaluate the signatures larger than or equal to 2k-bits. Table II shows delay, power consumption, and area requirements for AGSig and PBX signatures. Due to its additional hardware such as an AHT, AGSig exhibits a larger hardware overhead than PBX signature in all cases. However, as the signature size increases, the difference between AGSig and PBX becomes negligible.
The area overheads of signatures should be considered in the context of the whole processor area. We compare signature areas relative to the area of the Sun Rock [13] , which is fabricated in Texas Instrument 65 nm technology. The Rock's core area is 14 mm 2 and it supports four hardware threads. We assume that each thread has read-and writesignatures. Table III shows the area overheads of signatures. In consideration of the processor area, the difference of area requirements between AGSig and PBX is largely irrelevant.
V. CONCLUSIONS
In this paper, we show that some false positives in transactional memory systems can be beneficial to performance by prematurely stopping the execution of a transaction which will eventually encounter a conflict. Based on this observation, we presented an Adaptive Grain Signature scheme, AGSig, to improve performance by dynamically changing the range of the address key based on history. With the help of AGSig, we can increase the number of performance-friendly false positives as well as decrease the number of performance-destructive false positives. A TM system with the proposed signature achieves speedups by up to 75.07%, with an average of 17.19%, over a TM system with PBX signature, with less than a 5% area increase for large signature sizes.
