Abstract-Transactional Memory (TM) systems must track memory accesses made by concurrent transactions in order to detect conflicts. Many TM implementations use signatures for this purpose, which summarize reads and writes in fixed-size bit registers at the cost of false positives (detection of nonexisting conflicts). Signatures are commonly implemented as two separate same-sized Bloom filters, one for reads and other for writes. In contrast, transactions frequently exhibit read and write sets of uneven cardinality. This mismatch between data sets and filter storage introduces inefficiencies in the use of signatures that have some impact on performance. This paper presents different signature designs as alternatives to the common scheme to deal with the asymmetry in transactional data sets in an effective way. Basically, we analyze two classes of new signatures, called multiset and reconfigurable asymmetric signatures. The first class uses only one Bloom filter to track both read and write sets, while the second class uses Bloom filters of configurable size for reads and writes. The main focus of this paper is a thorough study of these alternative signature designs, including a statistical analysis of false positives and an experimental evaluation, providing performance results and hardware area, time and energy requirements.
T the beginning of the past decade, chip manufacturers started to turn to single-chip parallel processors (CMPs) [1] , due to power, memory and ILP constraints of single-core architectures. CMPs include multiple processor cores with a shared-memory internal architecture. Today, multicore processors have become mainstream, and have quickly made multithreaded parallel programming widespread.
In general, multithreaded programming is a challenging task which makes it difficult to exploit multicore processors. Parallelism introduces nondeterminism that must be controlled by a careful design of the computational threads and their coordination through explicit synchronization. Thus, mutual exclusion mechanisms must ensure correct concurrent access to shared data. Low-level primitives like locks have been traditionally used for that purpose. However, as locks serialize multithreaded execution, most expert programmers resort to fine-grained locking to improve performance. This adds complexity to parallel programming and requires great efforts to achieve both high performance and deadlock avoidance. Also, locks have other disadvantages difficult to solve, like convoying or priority inversion [2] , as well as ineffective mechanisms for abstraction and composition.
Nonexpert parallel programmers, though, seek productivity and performance at low programming complexity, which has caused a great interest in proposing alternative models to lock-based multithreaded programming. Transactional Memory (TM) [2] , [3] represents an alternative that inherits the concept of transaction from the database field, with the aim of easing the writing of concurrent programs. A transaction is a block of computations that appears to be executed in an atomic and isolated way. TM systems execute transactions in parallel committing nonconflicting ones. A conflict occurs when a memory location is concurrently accessed by two or more transactions, and at least one access is a write.
TM systems can be classified into software (STM) and hardware (HTM) systems, as well as hybrid and hardware accelerated implementations. In this paper, the interest lies in hardware implementations of TM, which include those systems that provide most of the required TM mechanisms implemented in hardware at the core level [4] , [5] , [6] , [7] , [8] , [9] , [10] , [11] , as well as those systems that provide hardware support to speed up parts of STM systems [12] , [13] , [14] .
The systems above must track all data read and written by each transaction in order to detect data races (conflicts) amongst them. Bloom filters [15] were proposed to summarize transactional accesses into two fixed-size bit registers, called signatures, at the cost of false positives (detection of nonexisting conflicts). Such two signatures store, respectively, memory addresses that are read (read set-RS) and written (write set-WS) inside transactions. Some TM proposals that include signatures are FlexTM [14] , BulkSC [16] , LogTM-SE [17] , SigTM [18] , STMlite [19] (software signatures), and DynTM [20] .
Read and write signatures are usually implemented as two separate, same-sized Bloom filters. In contrast, transactions frequently exhibit read and write sets of uneven cardinality. In addition, both sets are not disjoint, as data can be read and also written. This mismatch between data sets and hardware storage introduces inefficiencies in the use of signatures that have some impact on performance, as, for example, read signatures may populate earlier than write ones, increasing the expected false positive rate. This paper presents different signature designs as alternatives to the common scheme to deal with asymmetry in transactional data sets in an effective way. Basically, we analyzed two classes of new signatures: multiset and reconfigurable asymmetric signatures. The first class uses a single Bloom filter to track both read and write sets. Different alternatives were studied to take advantage of some important properties of data access patterns, like the significant amount of transactional memory locations that are both read and written, or the locality of reference. The second class uses Bloom filters of configurable size for reads and writes (a static approach was first discussed in [18] where the sensitivity to signature length was analyzed). A preliminary study of the first class of signatures can be found in [21] . We extend here such study by including a thorough analysis of the alternative signature designs above, along with a false positive probability analysis and a complete experimental evaluation comparing the different approaches in terms of performance, hardware area, time, and energy requirements.
The rest of the paper is organized as follows: Next section presents a background on signatures, their design and implementation, with a brief review of the related work. Section 3 introduces the proposed signature schemes, multiset, and reconfigurable asymmetric, discussing their concept and implementation, and showing a comparison with the common designs based on separate filters. Section 4 shows a statistical analysis of the proposed signatures, determining false positive rates in different contexts. Section 5 presents experimental results for the multiset and reconfigurable asymmetric signatures on the GEMS [22] simulator using the STAMP [23] workloads, and compares the performance attained by the different design alternatives. Besides, an analysis of area, time, and energy requirements using CACTI [24] is also shown. Finally, Section 6 concludes the paper.
BACKGROUND AND RELATED WORK
Ceze et al. [6] first proposed the implementation of hardware signatures by using per-thread Bloom filters. Such filters allow membership queries of elements over a set in a time and space-efficient way. Insertions of an unlimited number of elements can be performed at the cost of false positives (i.e., detection of nonexisting conflicts due to address aliasing). As elements can be added to the set but not removed, the filter is false negative free. A Bloom filter is a data structure which includes a bit array and k independent hash functions mapping the elements into k randomly distributed bits of the array. Initially, the array is zeroed. When an element is to be inserted into the Bloom filter, the k bits indexed by the hash functions are asserted. A positive membership query requires that those k bits are all set to 1.
As a hardware-efficient alternative to regular Bloom filters, Sanchez et al. [25] proposed the parallel Bloom filter implementation. Unlike the regular filter, which is implemented as a k-ported SRAM, the parallel one uses k single-ported SRAMs, while false positive rate is similar or even better. Also, Sanchez et al. concluded that the H3 class of hash functions [26] should be used instead of bitselection ones [27] , as H3 exhibits better distribution features. Nevertheless, the hardware cost of H3 is higher as an XOR tree per hash bit is needed.
Page-Block-XOR hashing (PBX) is a lower hardware cost implementation of H3 hash functions proposed by Yen et al. [28] . They use the concept of entropy to find the address bits which exhibit more randomness to use them as the inputs to the hash functions. Notary is also proposed in [28] as a technique to reduce the number of asserted bits in the signature based on privatization strategies. This way, only the shared addresses are recorded in the signature. Notary requires support from the compiler, runtime/library, and operating system levels. In addition, the programmer must allocate objects as either private or shared.
Recently, Choi and Draper [29] proposed adaptive grain signatures, which dynamically changed the input bit range to the hash functions based on transaction abort history. The aim of this design was to reduce the number of false positives that harmed the execution performance. In a more recent work [30] , Choi and Draper proposed a unified signature design that merged read and write signatures into a single one. Two variants of unified signatures were shown. One which tracked both read and write accesses, and another one which included a helper signature to filter out read-read conflicts. Multiset signatures proposed here also consider the basic unification of read and write filters, but this baseline design is optimized by a different strategy, based on partially sharing read and write hash functions. Both works are complementary.
MULTISET and ASYMMETRIC SIGNATURES
This section presents the multiset (MS) and reconfigurable asymmetric (ASYM) signature proposals and their implementation, both regular and parallel, as alternatives to separate (SEP) schemes. Fig. 1a shows the implementation of a regular SEP signature. It comprises two Bloom filters, one for keeping track of the addresses read by a transaction and the other one for the addresses written. Such filters can be implemented as SRAMs of 2 m bits. When k > 1, multiported SRAMs are needed to perform the operations in one cycle. Pi, i 2 ½0; k À 1 represents the ports in Fig. 1 . However, multiported memories require more hardware than single-ported ones and signatures must remain both concise and fast. The signature provides four operations: inserting an address into the read set filter (Insert RS), inserting an address into the write set (Insert WS), checking for membership of an address into both the read and write sets (Check RS+WS-transactional writes must be checked this way) and checking for membership of an address into the write set (Check WS-for transactional reads). In case of insertion, the Address is mapped into k indexes by the hashing functions, either h r i , i 2 ½0; k À 1, for the read set or h w i for the write set. Then, the Insert RS/WS signal enables writing in the SRAM (WE: write enable). In this case WE enables all the ports in the SRAM. The SRAM input ports are set to 1, so the bits indexed by the hash functions are all set to 1. In case of checking for write set membership, the k one-bit output data ports are ANDed together and Check WS selects the 0 input of the multiplexer which is then connected to Match. For read set and write set checking, both the AND output of the write set and the AND output of the read set are ORed, while Check RS+WS selects the 1 input of the multiplexer which connects it to the Match output.
Regular Multiset Signatures
Regular MS signatures join both the read set and the write set filters together in the same filter of twice the size: 2 mþ1 bits. Fig. 1b depicts how this proposal can be implemented. The read set hash functions and the write set hash functions index the whole SRAM address range. Therefore, 2k ports are needed and the Insert RS signal drives the first k WE inputs and the Insert WS signal drives the rest. The duplication of the number of ports of the SRAM leads the regular multiset signatures to a quadratic growth of the required area. In order to save in area and also to maintain time efficiency, parallel signatures are used [25] , [6] which do not need multiported SRAMs.
Parallel Multiset Signatures
A parallel Bloom filter comprises k arrays of 2 m =k bits. Each hash function indexes its own array, so one bit is set into each array on insertion. Fig. 1d shows the implementation of parallel SEP signatures. Like regular filters, parallel filters can be implemented as SRAMs. However, they use manifold smaller single-ported SRAMs instead of a larger multiported one, thus saving in hardware area. Furthermore, parallel Bloom filters have been proved to yield similar or better performance than regular ones [25] , [31] .
On the other hand, Fig. 1e depicts the implementation of the MS counterpart for the parallel signature. In this case, the SRAM is also partitioned into k smaller arrays but of 2 mþ1 =k bits. Now, each SRAM is indexed by two hash functions, one for the read set, h r i , and the other one for the write set, h w i . Therefore, parallel MS signatures take over twice more area than parallel SEP signatures, since parallel MS signatures need 2-ported SRAMs whereas parallel SEP signatures use single-ported SRAMs. To reduce the complexity of the MS scheme, next we propose to keep certain SRAMs single ported.
Parallel Multiset Shared Signatures
Several memory locations are read and written inside transactions. Some of them are only read and others are only written but many of them are both read and written. Section 5.4 shows that about 30 percent of locations are both read and written for the workloads tested. In such a case, storing the same address twice is redundant but the filter must be able to discriminate whether the address was only read or also written. Fig. 1f shows the proposed solution. Such a signature is a parallel MS signature where s SRAMs are single-ported, so they are indexed by only one hash function, h 0 ; h 1 ; . . . ; h sÀ1 , with s 2 ½0; k, i.e., hash functions are said to be shared between RS and WS. This way, when inserting an address into the filter, some arrays do not take into account whether the address was either read or written, they simply record one bit representing the address. That is why Insert RS and Insert WS are ORed to drive the WE signal of the singleported SRAMs. However, the rest of the arrays must continue to discriminate between reads and writes so they are addressed by a read hash function, h r , and a write hash function, h w . Consequently, in case of an insertion to the write set, h w would set a bit in its SRAM. Then, if the same address is subsequently inserted to the read set, a different bit would be set to 1 by h r in the same SRAM. On checking the signature, the bits from the single-ported SRAMs, which are the same for the read set and for the write set, are ANDed together and then they are also ANDed to the bits coming out of the double-ported SRAMs, which are different depending on the port: P0 bits correspond to the read set and P1 bits correspond to the write set.
Finally, to find out the value of s, a tradeoff between hardware requirements and signature performance has to be carried out. On one hand, if s is set to k, the signature implements k single ported SRAMs. Hence, parallel MS shared signatures require hardware similar to parallel SEP signatures but parallel MS signatures do not differentiate between read and written addresses which could degrade the performance. On the other hand, if s is set to 0, MS signatures implement k double-ported SRAMs increasing the hardware requirements but maximizing the probabilities of discriminating addresses. Section 5.4 explores every possible scenario.
Reconfigurable Asymmetric Signatures
We propose a reconfigurable asymmetric signature that can be configured at execution time to have a read set larger, the same length or smaller than the write set. The parameter a is provided by the Mask Register in terms of a mask whose value can be deducted from the expression 2 2k À 2 2kÀa , which stands for a number of ones padded with 2k À a zeros on the right. For example, assuming k ¼ 4 and the read set larger than the write set, a can be either 5, 6, or 7 which means that the read set comprises 5, 6, or 7 SRAMs and the write set comprises 3, 2, or 1 SRAMs, respectively. Consequently, the masks would be the next bit words: 11111000, 11111100, and 11111110.
On inserting an address into the asymmetric signature, the SRAMs' WE are selected depending on the Mask Register. For Insert RS, this signal is ANDed together with the mask, because it represents the number of SRAMs in the read set. For Insert WS, this signal is ANDed with the inverse of the mask which represents the SRAMs belonging to the write set. Finally, the result of both AND gates is ORed to drive the WE signals of the SRAMs.
On checking for WS membership, the output of the 2k SRAMs is bitwise ORed with the mask so that the RS SRAM output bits are ORed with a 1 and they result in a 1 whatever their value. However, the WS SRAM output bits are ORed with the 0 bits of the mask so they stay the same. Then, the 2k outputs of the OR gate are ANDed together giving as a result the AND of the WS SRAM output bits as the RS SRAM output bits have being all set to 1. To work out the RS match output, the same procedure is followed but the inverse of the mask is used.
The Mask Register could be loaded with a mask by means of either an instruction in the instruction set architecture or the contention manager of the TM system. In any case, the problem lies in finding the appropriate mask to configure the signature to yield the best performance. This is not a trivial problem, because we might need feedback from the compiler, runtime, TM system or from the behavior of the signature itself. Finding the appropriate mask also depends on whether a static (per-execution) or dynamic (per-transaction) signature configuration is preferred throughout the execution of the application. This is beyond the scope of this paper, so Section 5.3 explores several per-execution configurations of the ASYM signature to give more insight in its behavior compared to MS signatures. A heuristic to choose the best per-execution configuration is also proposed.
Hash Functions
Hash functions are implemented as H3 XOR hash functions [26] which comprise a set of XOR gate trees per function. XOR gate trees do not require significant area and, moreover, they can be replaced by a single line of XOR gates by using PBX hashing [28] .
However, the area of the hashing logic depends on the number of hash functions k and the number of bits required to address the SRAMs which depends on whether the signature is implemented as regular, parallel, SEP, or MS. Also, half of the address bits are used per bit of the hash function on average [28] , so the number of 2 fan-in XOR gates needed by an XOR tree that computes one hash bit is b ¼ d . Regular SEP signatures. As Fig. 1a shows, regular SEP signatures comprises 2k hash functions, k for the read set and k for the write set. However, as read set is separated from the write set they can use the same hash functions. Finally, the number of bits they need to index their arrays is m, then
. Regular MS signatures. The MS signature joins the RS and WS SRAMs together so their hash functions need one more output bit to index the SRAM, m þ 1, while the hash functions must be different each other
. Parallel SEP signatures. Depicted in Fig. 1d , such scheme divides the two SRAMs into k smaller SRAMs of 2 m k bits. The number of hash functions still is k but the index bits are m À log 2 k
. Parallel MS signatures. The parallel MS signature is like its SEP counterpart but requires 1 more bit per hash function and different hash functions for read set and write set
. Parallel MS shared signatures. Shown in Fig. 1f , their hash functions yield indexes of the same length as those of MS signatures. However, the number of hash functions changes in this case, because the ssingle-ported SRAMs are only indexed by one hash function each, so the total number of hash functions is 2k À s
. ASYM signatures. These signatures are similar to parallel SEP signatures but, in this case, the number of different functions is 2k À 1, because a can range from 1 to 2k À 1
Next, an example for real parameters is shown. For an address of 26 bits (32-6 bits of line address), m ¼ 10 and k ¼ 4, 480 XOR gates are needed for regular SEP signatures. The MS counterpart results in 1,056 gates. Parallel SEP signatures need 384 XOR gates whereas the parallel MS one needs 864. ASYM signatures needs 672. Note that MS schemes need one more bit per hash index and twice the number of hash functions than SEP signatures, because the arrays are double sized and share the filter. However, the MS shared scheme lowers the XOR gate requirements by lowering the number of hash functions. For example, the MS shared signature with s ¼ 1 needs 756 XOR gates. With s ¼ 2 only 648 gates and with s ¼ 3 it just needs 540 XOR gates, quite close to those needed by parallel SEP signatures.
The equations above provide an upper bound for the number of XOR gates required by the hash functions. They use half of the address bits per hash bit so the hash functions share many bit pairs of the address and hence, they can also share XOR gates. In fact, in Section 5.6, we found that the real number of XOR gates is lower than that given by (1) to (6).
FALSE POSITIVE ANALYSIS
The false positive probability of a Bloom filter can be computed as a function of its occupancy, that is, the number of bits asserted, and the number of bits to be checked in a query [25] , [32] . For an M-bit filter, if bits are asserted equiprobably by a hash function, the probability of a given bit being a 1 is
M is the probability of a bit being 0. If occ is the occupancy of the filter, that is, occ bits have been already asserted, the probability of a bit still being 0 is ð1 À 1 M Þ occ , and hence, if nchecks bits are going to be checked, the probability of getting a positive, i.e., all checked bits equal to 1, is
A common assumption is that the probability of false positive is approximately equal to that of getting a positive. The reason for that is Bayes' theorem:
Provided that the number of elements inserted into the filter is a small fraction of the total possible elements whose query is positive, the conditional probability is P rðFalsePositive j PositivesÞ % 1:
In this way P rðFalsePositiveÞ ¼ P rðFalse Positive j PositiveÞP rðPositiveÞ % P rðPositiveÞ:
A more usual form of (7) for a single filter of M bits, after inserting a sequence of n elements using k hash functions, is [31] , [25] p FP ðM; n; kÞ ¼ p P ðM; nk; kÞ
The last equation assumes that elements map into different bits during insertion. So, the occupancy is the number of inserted elements, n, by the number of hash functions, k, and each query checks k bits. The goal in this section is to evaluate the probability of false positives for an asymmetric/multiset configuration, as the one shown in Fig. 2 , where the Bloom filter array is split into three sections: one exclusively for RS (read section), other exclusively for WS (write section), and the last one for RS [ WS (multiset section). Let M be the total number of bits of the array. Then, M can be broken down into m r bits for the read section, m w bits for the write section and m providing
Consider a sequence of n ¼ CardðRS [ WSÞ addresses to be inserted into the filter, where n r ¼ CardðRSÞ, n w ¼ CardðWSÞ, n \ ¼ CardðRS \ WSÞ and consequently n ¼ n r þ n w À n \ . The false positive rate for the multiset section is given by the next equation, as an address which has been both read and written will assert k
According to (8) the probability of getting a false positive on checking a read involves getting a positive both in the read section and the multiset section of the filter in Fig. 2 
Note that p cr and p cw are conditioned by the TM system. In our TM simulation environment, about 94 percent of checks involved both checking read and write signatures, whereas the rest were write checks only (see Appendix, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/ TPDS.2012.138).This is due to invalidations, replacements and L2 cache misses which are frequent events in contended codes with large transactions, exacerbated by LogTM's cacheable logs. Such events needs both read and write filters to be checked to ensure isolation in the virtualized TM system [17] . Therefore, we can assume that
Several situations of interest have been evaluated in Figs. 3a, 3b, 3c, 3d, 3e, 3f, 3g , 3h. Plots represent (11) considering two variables for the filter in Fig. 2 : the asymmetry between the read and write sections (labeled as asymmetric factor in plots), and the fraction of the total filter taken up by the multiset section (labeled as multiset fraction). Both variables range from 0 to 1. The lower part corresponds to a contour plot of the expected false positive probability, whereas a side view of the surface is depicted in the upper part, showing more clearly its minimum values. The cardinality of RS and W S is sketched in a Venn's diagram. It is to be noted that Figs. 3a, 3b, and 3c correspond to a symmetrical insertion pattern, where the cardinality of RS and WS are identical.
When the occupancy is low (Fig. 3a) the symmetrical separate configuration and the full multiset get similar false positive rates. Nevertheless, the situation changes when the filter is more populated (Fig. 3b) . In this case, asymmetric separate configurations exhibit better false positive rates. An heuristic occupancy threshold of s=M % 2=3 can be found, which separates these two scenarios. Fig. 3c shows the effect of having non null RS \ W S, which increases the number of insertions, because there are memory locations that are inserted as both read and written. Thus, higher occupancy is expected for nonmultiset filters, and consequently, more false positives. Fig. 3d illustrates an asymmetric insertion pattern where CardðRSÞ > CardðWSÞ. In this case, several configurations lead to the minimum false positive probability. Both full multiset and a separate asymmetric solution can be chosen.
Scenarios of Figs. 3e and 3f, where almost W S & RS, give advantage to the full multiset configuration versus having RS and WS completely separated. Nevertheless, as k [p grows with respect to k [s , the advantage gets smaller, because addresses that are both read and written assert more bits in the multiset part, as can be observed in Fig. 3f .
Finally, Figs. 3g and 3h show two completely asymmetric situations breaking the preceding assumption that p cr ¼ 1 2 . In Fig. 3g , both the insertion and checking patterns are biased to reads (CardðRSÞ > CardðWSÞ and RS is the most frequently checked). Here, a separate asymmetric configuration gets a slightly lower false positive probability than the full multiset. In Fig. 3h , CardðRSÞ > CardðWSÞ as well, but WS is the most frequently checked set. This last situation may happen if the probability of checking the write set is much larger than that of checking the read set. Here, the full multiset configuration has no advantages over the separate.
Next section evaluates several implementations of the analyzed signatures. Fig. 4 shows the explored solutions in terms of three dimensions: read asymmetry (m R to m W ratio), hash sharing (k [s fraction of m [ ; m SH in Fig. 4 ) and multiset (k [p fraction of m [ ; m MS in Fig. 4 ). Our proposed signatures are marked with circles: four ASYM and five MS schemes. The SEP scheme, symmetric and separate, is equivalent to ASYM a ¼ 4. Unified (UNI) blind and helper signatures from Choi and Draper [30] are also shown, marked with squares. Note the equivalence between MS s ¼ 4 and UNI blind schemes. The UNI helper signature remains in the shared plane but slightly shifts on the asymmetric axis due to its helper write register. Results (not shown) for UNI helper and MS s ¼ 3 signatures are fairly similar.
EXPERIMENTAL EVALUATION

Methodology
Simics [33] , a full system execution-driven simulator, was used to evaluate the signatures schemes in Section 3. Simics simulates the Sun Fire server brand and the SPARC architecture and it is able to run an unmodified copy of a Solaris operating system. Solaris 10 was installed on the simulated machine.
A 16-core CMP system was considered for simulation. Each in-order single-issue core had a 32 KB, four-way, 64B block private L1 I and D cache. L2 cache was unified and shared, with a capacity of 8 MB organized in 16 banks, 8 ways and 64B blocks. Cache coherence was based on the MESI protocol with an on-chip directory holding a bit vector of sharers per block. Main memory was 4 GB.
As regards the TM system, the GEMS module [22] was used, which is provided by the Wisconsin Multifacet Project as an open-source module for Simics. GEMS's Ruby module implements the LogTM-SE HTM [17] and also includes a detailed timing model for the memory system. Ruby was modified to include all proposed signature schemes described in Section 3.
Perfect signatures were used as a reference, because they do not yield false positives. Filter size ranged from 64 bits, which matched the word length in SPARC architecture, to 8K bits length, which matched the performance of perfect signatures for the simulated benchmarks. All filters used four hash functions of the H3 family and the same H3 matrices of Ruby.
The proposed signature schemes were experimentally evaluated using all the codes of the STAMP suite [23] . Such a suite is oriented to the evaluation of TM systems. It covers a comprehensive set of codes including long-running Fig. 4 . Explored solutions. Fig. 3 . Expected value of the false positive probability according to (11) . Lower part corresponds to a contour plot of the surface. Upper part depicts a side view.
transactions and large read and write sets. For signature evaluation, these codes are of special interest as signatures are stressed. STAMP benchmarks were adapted to GEMS by applying Luke Yen's patches [31] . Table 1 shows the input parameters and main transactional characteristics of the benchmarks. "#xact" is the number of committed transactions and "Time percent" is the percentage of transactional execution cycles. The last columns show the average and the maximum values of RS and WS size distributions in cache blocks as well as the ratio.
Finally, Ruby added pseudorandom delays to memory accesses to deal with variability in simulation experiments. Therefore, multiple runs of each experiment were done to obtain confident error bars [34] . 1. SSCA2. This benchmark was not signature dependent, because of its small transactions, the smallest of the whole suite as Table 1 shows. In addition, it spent most time outside transactions. 2. Bayes, Genome, Intruder, Vacation, and Yada. These five workloads behaved better when using MS signatures instead of SEP ones. Bayes and Yada got a slight improvement of their execution time for certain signature sizes, about 1:2Â for Bayes with parallel small signatures and 1:2Â for Yada with regular large ones. These benchmarks showed large transactions that introduced cross false positives. Cross false positives appear in MS signatures as filter fills. For example, a transaction inserting only reads in the signature could yield a cross false positive, because of filter occupancy, if a test for a write hits the signature. Fig. 6 shows the average false positive percentage for regular SEP and MS signatures. High cross false positive percentages can be appreciated for Bayes, Genome, Intruder, Vacation, and Yada but the overall false positive percentage is lower than that for SEP signatures. Note that MS signatures equalizes the number of read set and write set false positives.
Regular and Parallel Multiset Signatures
On the other hand, Genome, Intruder, and Vacation performed better using MS signatures. Genome was 1:4Â faster with 1 Kbit and 2 Kbit filters. Intruder also exhibited about 1:4Â speedup from 256 bit filter downwards, and up to 2:5Â speedup was achieved for Vacation.
Kmeans and Labyrinth. MS signatures did not
properly work with these workloads. Regular MS signatures performed like regular SEP ones for Kmeans but parallel MS ones performed worse for some filter sizes. For Labyrinth, MS signatures performed much worse than SEP filters specially for parallel ones. Labyrinth's transactions are large on average and fill the filter beyond the 2/3 threshold (see Section 4) introducing many cross false positives. Fig. 6 shows that, in this case, cross false positives translates into a higher overall false positive percentage than that for SEP signatures. Next sections propose certain configuration enhancements that will ameliorate filter occupancy. Parallel signatures performed similar than regular ones in most cases, as shown in Fig. 5 , and required much less area (see Section 5.6). Therefore, subsequent optimizations were made over the parallel scheme.
Reconfigurable Asymmetric Signatures
ASYM signatures, as seen in Section 3.4, can be configured to have different read set and write set sizes. The reconfiguration could be performed dynamically at runtime. However, in this case some static per-execution configurations were tested as dynamic reconfiguration would need feedback from different parts of the TM system, runtime, or compiler, which is out of the scope of this paper. Therefore, three asymmetric configurations are shown in Fig. 7 : a ¼ 5 which means that the read set comprises 5 SRAMs and the write set 3 SRAMs, a ¼ 6 which devotes 6 SRAMs to the RS and 2 to the WS, and a ¼ 7 with 7 RS SRAMs and just 1 WS SRAM. Configurations with larger WS than RS were not taken into account, because benchmark transactional characteristics in Table 1 showed that the average RS size of transactions is always larger than or similar to the average WS size for the tested codes.
As shown in Fig. 7 , the best asymmetric configuration for each benchmark behaves worse than or similar to the multiset signatures, except for the benchmarks which already behaved badly with multiset signatures, i.e., Kmeans and Labyrinth.
Note that the best ASYM configuration could be chosen by studying the average ratio of the data sets of the benchmarks. Last column of Table 1 shows the ratio between the average RS size and the average WS size of transactions. Bayes had a data set ratio of 1.88 and it behaved better with a ¼ 5 as a ¼ 5 involved a 1.66 filter ratio which was closer to 1.88 than a ¼ 6 with a ratio of 3 and a ¼ 7 with a ratio of 7. Genome exhibited a ratio of 2.88 wich almost matched the ratio of 3 for a ¼ 6 with which it performed better than other ASYM configurations. Kmeans was 2.06 which was closest to a ¼ 5 and it yielded the best results. For Labyrinth, the parallel SEP version showed the best results as it actually was an ASYM signature with a ¼ 4 and a ratio of 1. As seen in Table 1 , Labyrinth had a 1.22 RS to WS ratio, so it was closest to a ¼ 4 than to a ¼ 5. Yada showed a ratio of 1.63. This would have lead to choose a ¼ 5 as configuration value and results showed that it was the best choice in this case. However, Intruder and Vacation failed to perform the best with the configuration parameter suggested by their RS to WS ratio. Intruder showed a 7.64 ratio closest to a ¼ 7 but in this case the best asymmetric configuration was a ¼ 6. And, Vacation ratio is 5.47 but the best configuration was a ¼ 6 as well.
Therefore, the RS to WS ratio is a heuristic to choose the configuration for ASYM signatures but more feedback from other parameters is needed to assure best results. As ASYM signatures are not a general solution if we lack effective mechanisms to dynamically reconfigure them on a pertransaction basis, they were taken aside to gain more insight in MS signatures in next sections.
Parallel Multiset Shared Signatures
Parallel MS shared signatures were described in Section 3.3.
The motivation behind such a signature came from Fig. 8 which shows the percentage of addresses that were either exclusively read, exclusively written or both read and written inside transactions for each benchmark. For example, Bayes and Kmeans exhibit close to 100 percent of written addresses that were also read. Overall, about 30 percent of total memory locations accessed by each benchmark were both read and written. As the percentage of addresses both read and written inside transactions was significant, next step was to figure out the number of filters that could be implemented as single-ported SRAMs in MS signatures without losing performance. For that purpose, experiments were conducted with parameter s ranging from 0 (equivalent to parallel MS signatures) to 4 functions (every insertion involves both read and write sets). Fig. 9 shows the execution time for parallel MS shared signatures. As read and write sets hash functions were shared the results got better for all the benchmarks. In fact, MS s ¼ 4 got the best results for every workload except for Bayes and Genome, which execution slowed down about 1:25Â with respect to parallel filters for 8 Kbit signatures. Therefore, MS s ¼ 3 signatures were conservatively chosen instead of s ¼ 4 for the sake of generality. In the Appendix, which is available in online supplemental material, we show more results to gain insight into read-read dependencies due to shared signature schemes.
Enhancing Multiset Signatures with
Locality-Sensitive Hashing
In this section, locality-sensitive hashing [31] was used to enhance parallel MS s ¼ 3 signatures. Locality-sensitive hashing takes advantage of locality of reference, which is usually exhibited by applications to a greater or lesser extent, to store a set of addresses more concisely. In a Bloom filter with locality-sensitive hash functions, nearby locations assert nondisjoint bits into the bit array saving occupancy. Locality hash functions operate as follows: an address maps into k different indexes through k binary n Â m matrix (n ¼ address length and m ¼ index length), which have null rows depending on the locality granularity. Then, only one hash index is different between two contiguous addresses as one matrix has no null rows and the rest has null rows for the less significant bits of the address. Addresses with distance two have two different indexes, i.e., two matrices has no null rows. Addresses with distance greater than 2 kÀ1 À 1 may have no hashing outputs in common as all matrices has no null rows for these less significant bits of the addresses. That is, one hash operates as a generic H3 hash, but the others take advantage of locality with different granularity. Fig. 1f ) and asserted different bits in the same filter, some bits for the read set and some others for the write set. In the first locality scheme (L1), h r 3 and h w 3 took advantage of locality with maximum granularity, i.e., their corresponding matrices had three null rows. h 2 and h 1 dwindled the granularity and h 0 behaved as a generic H3 function, i.e., h 0 had no null rows. This way, separate functions discriminating locations of read set, h r 3 , from locations of write set, h w 3 , asserted less bits in its filter reducing the false positive rate, but failing to discriminate locations read from nearby located writes. L2. In this case, h 0 was the function which took advantage of locality with maximum granularity, while h r 3 and h w 3 behaved as generic H3 functions (no null rows in their matrices), i.e., the filter which did not share the hash functions stayed the same as in s ¼ 3configuration, thus discriminating between locations read and written, and the other filters got the locality improvement. As Fig. 10 shows, results for L1 scheme were practically the same than those for L2 for every benchmark except for Labyrinth, Genome and Yada. Labyrinth behaved better with L2 for small signatures, but Genome and Yada got slightly worse results. MS shared locality signatures outperformed parallel and locality SEP ones in most cases. Speedups over parallel SEP signatures can range from 15 percent of MS s ¼ 3 up to 47 percent of MS s ¼ 3 L2 on average.
Hardware Implementation
This section deals with area, time, and energy requirements for the proposed signatures. Table 2 shows the results gathered for different filter sizes. "Filter size" row stands for the size of one set filter, i.e., 4 Kbit means two filters of 4Kbit (for RS and WS) for SEP signatures and one filter of 8Kbit for MS ones. CACTI 6.5 [24] , [35] was used to model the SRAMs using the 65 nm process. Due to limitations of CACTI, SRAMs under 1 Kbit were not considered. All SRAMs had separate read/write ports, which were dual ended, meaning that two lines were required per bitline. Output/input bus width was set to one and CACTI's optimization function searched for the best partition of the cell array depending on time, power, and area efficiency.
Concerning SRAMs, regular SEP signatures had two 4-ported SRAMs, one for the RS and one for the WS (see Section 3). Regular MS signatures had one 8-ported SRAM and they were three times as large as SEP ones due to port increase. Parallel SEP and ASYM signatures comprised eight single-ported SRAMs, so they were more concise than regular ones, specifically, 6.5 times on average. Parallel MS signatures had four double-ported SRAMs and they were more than twice as large as parallel SEP signatures, because of the double-ported SRAMs. Parallel MS s ¼ 3 signatures had three single-ported SRAMs and only one double-ported SRAM lowering the area up to 1.2 times the area of parallel SEP signatures. Finally, an alternative hardware implementation to parallel MS s ¼ 3 was introduced. It was called parallel s ¼ 3 and comprised three single-ported SRAMs and two single-ported SRAMs of half size instead of one double-ported SRAM. This scheme yielded the same execution time results (not shown) as parallel MS s ¼ 3 but less hardware was needed in its implementation, because all SRAMs were single-ported. In fact, parallel s ¼ 3 was 10 percent less hardware-consuming than parallel SEP, because large single-ported SRAMs provided better area efficiency results as they multiplexed the bitlines to share sense amplifiers. SRAM time results showed that MS schemes were about 12 percent slower than SEP designs, because of double-ported SRAMs. However, the parallel s ¼ 3 signature was about 5 percent slower, because of the large single-ported SRAMs. Energy values showed that an increment in area for MS signatures translated into an increment in energy consumption, but such an area increment came from bitlines and wordlines whose energy consumption was less significant than other components in the SRAMs. Thus, parallel MS s ¼ 3 signatures were 6 percent more energy-consuming than parallel SEP ones, while parallel s ¼ 3 scheme saved up to 19 percent of energy with respect to parallel SEP.
As regards hash functions, combinational logic modules were implemented in Synopsys Design Compiler using the H3 matrices of the simulation. These matrices showed 13 bits per column on average (half of 26 address bits) and they had many bit pairs in common. Synopsys Design Compiler optimized the XOR gate trees finding these bit pairs to use as few gates as possible. Table 3 shows the results of these optimizations compared to the upper-bound values estimated in Section 3.5, multiplied by 3:6 m 2 which was the area of 2 fan-in XOR gates in Synopsys. Also, the optimized version used 3 and 4 fan-in gates. Comparing H3 hash and SRAM area in Table 2 , hash logic was about one 20th of the SRAM area for parallel SEP and s ¼ 3 2 Kbit signatures. Note that the hash logic grows linearly with the filter size, so the fraction of H3 hash logic with respect to SRAMs is more negligible as filter size grows. Locality-sensitive hashing (not shown) uses about 12.5 bits per column of hash matrices on average, thus reducing area up to 3 percent with respect to generic H3 implementation. We show in Table 2 the area required by PBX implementation of hash functions as a hardware cost lower bound for H3-type signatures. PBX was about four times as concise as H3 for parallel SEP and s ¼ 3 signatures, because only an XOR gate per index bit was needed instead of a whole XOR tree of height four. Table 2 shows a fourfold increment in time and energy for H3 implementation of hash functions compared to PBX. PBX exhibited best implementation figures and performed similarly to H3 [28] , however, PBX lacked generality, because a previous study of the entropy of workloads must be carried out to enable better signature performance. Last but not least, H3 delay could be hidden by pipelining the index generation and the access to the SRAMs.
CONCLUSIONS
Signatures used in transactional systems are commonly implemented as two separate same-sized Bloom filters, one for reads and other for writes. This work proposes multiset and reconfigurable asymmetric signatures as alternative schemes to tackle the uneven cardinality in read and write sets that many workloads exhibit. Multiset signatures use only one Bloom filter to track both read and write sets, while asymmetric signatures use Bloom filters of different configurable size for reads and writes. This paper presents a thorough study of the proposed signature designs. In addition, multiset signatures are enhanced using localitysensitive hashing previously proposed by the authors.
Multiset and asymmetric signatures were implemented in the Wisconsin GEMS simulator. CACTI and Synopsys Design Compiler were used to evaluate hardware requirements. Experimental results show that the best signature configuration improves the execution performance up to 47 percent on average, without increasing significantly the hardware complexity. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
