The scalability of cache coherence protocols is a significant challenge in multicore and other distributed shared memory systems. Traditional snoopy and directory-based coherence protocols are difficult to scale up to many-core systems because of the overhead of broadcasting and storing sharers for each cacheline. Tardis, a recently proposed coherence protocol, shows potential in solving the scalability problem, since it only requires O(log N) storage per cacheline for an N-core system and needs no broadcasting support.
INTRODUCTION
As the number of cores on a single chip increases geometrically following Moore's Law, the cache coherence protocol becomes a potential scalability and performance bottleneck. Snoopy coherence protocols work well for small-scale systems with a few cores, but do not scale due to the traffic pressure caused by broadcasting messages on the bus. Directorybased coherence protocols [1, 2] have better scalability and are widely used in multicore processors today [3, 4] . For future systems with hundreds or even thousands of cores, however, the storage overhead of a full-map directory becomes a serious scalability bottleneck. A number of enhancements to directory coherence protocols have been proposed in the literature [5, 6, 7, 8 ] to improve scalability. But they typically sacrifice performance and incur extra implementation and verification complexity.
As an alternative, a recently proposed cache coherence protocol called Tardis shows better scalability while maintaining simplicity and high performance [9] . The key insight behind Tardis is that it is sufficient but unnecessary to enforce a consistency model in physical time order, which is what traditional coherence protocols do. Instead, Tardis enforces the consistency model using logical or physical time. As a result, a memory write no longer needs to invalidate shared copies. The write can simply create a new version at a later logical time which coexists with the previous versions at the same physical time. The consistency model is not violated since the new version is properly ordered after the previous version in logical time order.
Introducing logical time into the coherence protocol brings several salient properties to Tardis. First, Tardis is scalable; it has no sharer list and only requires O(log N) storage per cacheline for an N-core system. It also does not require broadcasting support. Tardis is also simple; it uses timestamps to explicitly express the consistency model and is easy to reason about and verify [10] . Different from other coherence protocols, Tardis does not need invalidation and is therefore able to achieve better performance on some benchmarks.
There are three drawbacks of the original Tardis protocol. The first drawback is that it only supports the sequential consistency (SC) memory model. While SC is simple and well studied, commercial processors usually implement more relaxed models. Intel x86 [11] and SPARC [12] processors implement Total Store Order (TSO); ARM [13] and IBM Power [14] processors implement weaker consistency models. It would be difficult for commercial processors to adopt Tardis since the target memory models are not supported.
Another drawback of Tardis is the renew message that is used to extend the logical lease of a shared cacheline in a private cache. These messages incur extra latency and bandwidth overhead. They are unique to Tardis and are not required in a traditional coherence protocol. Although some optimizations have been proposed to reduce the renewal overhead [9] , it remains a key disadvantage of Tardis and results in additional network traffic.
Finally, Tardis uses a timestamp self-increment strategy to avoid livelock. This strategy has suboptimal performance when threads communicate via spinning.
In this paper, we will address these drawbacks of the original Tardis protocol and make it more practical. Specifically, we will discuss the changes to the cores and the memory subsystem in order to implement TSO on Tardis. A formal proof that our algorithm correctly implements TSO is also given. We propose new optimization techniques to reduce the number of renew messages in Tardis, which improves performance. We also propose a more efficient mechanism to handle livelock.
Our simulations over a wide range of benchmarks indicates that TSO and our optimizations both improve the performance and reduce the network traffic of Tardis. Compared to a full-map directory-based coherence protocol, the optimized Tardis is able to achieve better performance (1% average improvement, upto 6.2%) and lower network traffic (5.6% average reduction, upto 14.8%) at the same time. The optimized and baseline Tardis protocols are more spaceefficient than directory protocols and simpler to implement.
BACKGROUND
In this section, we introduce the key concepts behind Tardis (Section 2.1) and the basic Tardis protocol (Section 2.2). We also show the pros and cons of Tardis compared to a directory cache coherence protocol (Section 2.3).
Physiological Time
Traditional cache coherence protocols enforce a consistency model in physical time order. A write to a shared cacheline can be performed only after all the shared copies are invalidated because the write must be ordered after all previous reads with respect to physical time. The invalidation mechanism requires either broadcasting supporting in the network (as in snoopy coherence protocols) or a sharer list in the directory (as in directory-based coherence protocols) and is therefore not able to scale to large numbers of cores. Tardis avoids invalidation by introducing logical time into the cache coherence protocol. In Tardis, a write does not invalidate shared copies, instead, a new version of the cacheline is created at a later logical time at which point all previous shared copies have expired logically. This usage of logical time to order data bears some similarity to the multiversion concept in concurrency control algorithms used in database systems [15] .
Specifically, Tardis invented a new concept of time called physiological time and enforces the consistency model with respect to it. Using < pt and < ts to represent physical time and logical timestamp order respectively, the physiological time order (< pl ) is defined by Eq. (1).
An operation X is before Y in physiological time order if X has a smaller timestamp or if they have the same timestamp but X happens earlier in physical time. If two operations happen at the same logical and physical time, then we can use any metric to break the ties.
The definition of physiological time makes it easy to reason about the consistency model implemetned on Tardis. Sequential consistency (SC), for example, requires following two invariants, where < p and < m are the program order and the global memory order respectively, and L(a) and S(a) are load and store to address a respectively.
The execution of a parallel program should be equivalent to a serial execution where all operations from each processor (core) are sequentially executed (Invariant 2), and the sequential order should agree with the program order (Invariant 1).
In Tardis, the global memory order (< m ) in the definition of SC is replaced with the physiological time order (< pl ). Section 2.2 describes the detailed Tardis protocol which enforces both invariants of SC.
Tardis Coherence Protocol
Tardis uses timestamps to represent logical time. Each committed instruction has a commit timestamp. The combination of the commit timestamp and the physical commit time forms the physiological commit time which represents the sequential order of an instruction. For SC, this timestamp should monotonically increase in the program order (Invariant 1 in SC). A program timestamp (pts) is used to represent the commit timestamp of an instruction.
An important concept in Tardis is timestamp leasing of a cacheline. Each shared cacheline is valid within a lease which is a range of timestamps. The start and end timestamps of a lease are represented using write timestamp (wts) and read timestamp (rts) respectively. Both timestamps are stored next to each cacheline. The wts is the timestamp when the data is created, and the rts can be considered as a promise that no write happens between wts and rts. For SC, a read operation can only happen within the lease of a cacheline (wts ≤ pts ≤ rts). For cachelines in Modified or Exclusive state, the end lease is not fixed and may increase if pts is greater than the current rts.
For a shared cacheline in the last level cache (LLC), Tardis maintains the invariant that the cacheline's rts is no less than the maximal rts of its shared copies in private caches. For a read request, the data and both timestamps are returned to the requester. The rts might increase depending on the pts of the requesting instruction, since the returned rts should be no less than pts. For a write request, Tardis does not send invalidations, instead, exclusive ownership can be immediately returned to the requesting core which will perform the write at a timestamp greater than the cacheline's current rts. Since all shared copies of the cacheline have rts less than or equal to the rts in the LLC, the new version created by the write is ordered after all the previous versions in logical time. And therefore SC is guaranteed in physiological time order.
If the cacheline is exclusively owned by another core when a read or write request arrives at the LLC, then a request is sent to the owner core, which will then write the latest data back to LLC. This is the same operation as in a traditional coherence protocol.
In a private cache, a cacheline in modified state can always serve read and write requests. The timestamp may be updated to reflect the latest read or write operation. A cacheline in shared state, however, can only be accessed if pts is less than or equal to rts. If pts is greater, then the core does not know whether the current data in the cacheline is still valid at pts or it has been changed by another core at some timestamp between rts and pts. As a result, a renew request must be sent to the LLC to extend the rts of the cacheline. If the rts cannot be extended because a new version of data has been created, then the latest version of the cacheline will be returned.
In Tardis, if a core keeps reading a stale cacheline at a pts smaller than the cacheline's rts, then the core may never observe a later update and livelock may occur. To guarantee that a write is eventually observed by other cores, Tardis forces the pts of each core to periodically increase so that a stale version of a cacheline will eventually expire. More discussion on livelock will be presented in Section 4.1.
Tardis vs. Directory-based Coherence
A key advantage of Tardis compared to a traditional physical time based coherence protocol is the removal of the invalidation mechanism. This simplifies the protocol design, reduces the storage overhead, and achieves potentially better performance.
Removing the invalidation mechanism means that the LLC does not notify the sharers when a cacheline is modified. So each core should contact the LLC to learn the freshness of a cacheline. This is done through renew messages (cf. Section 2.2). In a sense, replacing invalidations with renewals is replacing a push-based model with a pull-based model. A pull-based model is usually simpler and more scalable, but may lead to wasted messages. For example, read only data may be constantly renewed due to expiration, but these messages would not exist in an invalidation-based protocol. The latency and bandwidth overhead of the extra renew traffic is the major disadvantage of Tardis compared to directory coherence protocols.
Some optimization techniques (e.g., speculative read) have been proposed to hide the latency of renewals [9] . However, the performance and network traffic overhead were still significant for some benchmarks. We will present our solutions to the renewal problem in Section 4.
TSO ON TARDIS
The original Tardis protocol only supports sequential consistency (SC) memory model, but commercial processors usually implement more relaxed consistency models for better performance. In this section, we show how Tardis can be generalized to relaxed consistency models. We first use Total Store Order (TSO) as a case study since it has a precise definition and is widely adopted. We then generalize the discussion to other memory models.
Formal Definition of TSO
A consistency model can be specified using either an operational or an axiomatic model [11] . The two models are equivalent to each other. In this paper, we use the axiomatic model to define TSO since it makes it easier to reason about physiological time.
Similar to SC, TSO also requires a global order (specified using < m ) of all memory instructions. However, the global memory order only needs to agree with the program order for certain memory dependencies. Specifically, TSO can be defined using the following three invariants [16] . The differences between TSO and SC are highlighted in boldface.
In TSO, the program order implies the global memory order only for Load → Load, Load → Store and Store → Store dependencies. For Store → Load dependency, the load can bypass the pending store requests and commit earlier (Invariant 1). Although the load is after the store in the program order, it is before the store in the global memory order.
Traditional processors implement TSO using a store buffer which is a FIFO storing the pending store requests. If a load address exists in the store buffer, then the value is directly returned; otherwise, the load accesses the cache hierarchy and may commit before previous stores finish (Invariant 2). Since global memory order is the physical commit time order in traditional processors, this bypassing behavior allows a load later in program order than a store to commit earlier in global memory order.
TSO uses a fence instruction when a Store → Load order needs to be enforced (Invariant 3). In a traditional processor, a fence flushes the store buffer enforcing that all previous stores have committed so that a later load will be ordered after stores before the fence in physical time and therefore the global memory order. If all memory operations are also fences, then TSO becomes SC.
TSO Implementation on Tardis
In this section, we describe how TSO can be implemented on Tardis. Specifically, we discuss the changes to timestamp management policy compared to the baseline Tardis protocol for SC.
Program Timestamp Management
The original Tardis protocol implements SC and uses a single program timestamp (pts) to represent the commit timestamp of an instruction. Since the program order always agrees with the global memory order, pts monotonically increases in the program order.
In TSO, however, the program order does not always agree with the global memory order. Following Invariant 1 in TSO's definition, a store's timestamp is no less than the timestamps of all previous loads, stores and fences in the program order. A load's timestamp is no less than the timestamps of previous loads and fences, but not necessarily previous stores. As a result, it is difficult to use a single monotonically increasing pts to represent the ordering constraint. To express the different constraints for loads and stores respectively, we split the original pts into two timestamps. The store timestamp (sts) represents the commit timestamp for a store, and the load timestamp (lts) represents the commit timestamp for a load. According to Invariant 1, both sts and lts should monotonically increase in the program order. Furthermore, the sts of a store should be no less than the lts of the last load. For a load, however, its lts is independent of the last sts.
A fence can be simply implemented as a synchronization point between sts and lts. The smaller one of the two timestamps should increase to be the value of the bigger one. Therefore, operations after the fence are ordered after operations before the fence in physiological time order (and therefore the global memory order). Thus, Invariant 3 is maintained. If each memory operation is also a fence, then the commit timestamp for each operation monotonically increases and the protocol becomes Tardis SC. Note that the lts in TSO is usually smaller than the pts in SC.
Data Timestamp Management
To switch from SC to TSO, the logic in the LLC is not changed. But the timestamp management in the private L1 should be changed slightly for load requests.
To load a cacheline that is not dirty (i.e., the data has not been written by the caching core since it is cached), the timestamp rule is the same as in SC; the lts should fall within the lease of the cacheline (wts ≤ lts ≤ rts). And if lts is less than wts, then lts should jump ahead in logical time in order to perform the load.
For a dirty cacheline in the store buffer or in the L1 cache, however, the lts does not have to be greater than the wts of the cacheline. With respect to the global memory order, this means that the load can return data at an lts that is before the creation of the data which is at wts. This behavior certainly violates SC but it is completely legal in TSO.
According to Invariant 2 of TSO, a load should return either the last store in global memory order or the last store in program order, depending on which one has a larger physiological time. Since a dirty cacheline must be written by a store from the current core prior to the load in program order, even if the load has a smaller timestamp than the store, Invariant 2 is still maintained. A more formal proof of correctness will be presented in Section 3.4.
In a traditional coherence protocol, the main advantage of TSO over SC is the performance gain due to loads bypassing stores in the store buffer. In Tardis, besides the bypassing, TSO can also reduce the number of renewals compared to SC. The lts in TSO may increase slower compared to the pts in SC. As a result, fewer shared cachelines expire. Note that the renewal reduction benefit does not require the existence of a store buffer. In fact, TSO can be implemented on Tardis even on in-order cores that do not have a store buffer, because Tardis can figure out the correct memory ordering using logical timestamps.
TSO Examples Listing 1: Example Program
[ core0 ]
[ core1 ]
We use the example program in Listing 1 to demonstrate how timestamps are managed in Tardis TSO. The execution of the program is shown in Fig. 1 . For simplicity, we do not model the store buffer and execute one instruction from each core at each step.
Initially, both addresses A and B are cached in Shared (S) state in both cores' private cache as well as in the shared LLC. wts of all lines are 0; rts of all lines of address A are 5 and rts of all lines of address B are 10.
In step 1, core 0 writes to address B and core 1 writes to address A. Exclusive ownership of A and B are given to core 1 and core 0, respectively, and both stores can be performed by jumping ahead in logical time to the end of the lease. After the stores, core 0's sts jumps to timestamp 11 and core 1's sts jumps to 6, but the lts of both cores remain 0.
In step 2, core 0 loads address B. The value of the previous store from core 0 is returned (r1 = 1). Since B is in dirty state, the load can happen without checking the timestamps and lts does not need to increase (Section 3.2.2). In core 1, a fence instruction is executed which synchronizes the lts and sts to timestamp 6.
In step 3, core 0 loads address A. Since its lts is 0 which falls between the wts and rts of cacheline A, this is a cache hit and value 0 is returned (r2 = 0). In core 1, the load to address B also hits the L1 since its lts = 6 falls within A's lease. As a result, the loaded value is also 0 (r3 = 0).
Listing 2 shows the physiological commit time for each instruction in Listing 1. It also shows the global memory order using arrows. The physiological time is represented using logical timestamp and physical time pair (ts, pt) where ts is the commit timestamp and pt is the physical commit time of the instruction. According to the definition, (ts 1 , pt 1 ) < (ts 2 , pt 2 ) if ts 1 < ts 2 or (ts 1 = ts 2 and pt 1 < pt 2 ).
Listing 2: Physiological commit time and global memory order.
The execution is definitely not sequentially consistent since the program order in core 0 is violated between B = 1 and L(B). But it obeys all the invariants of TSO. Note that the store buffer is not used in the example but TSO can still be implemented. This feature is not available in traditional physical time based coherence protocols.
Proof of Correctness
In this section, we prove that the algorithm in Section 3.2 correctly implements TSO. We first introduce several definitions and invariants (in the form of lemmas) of the Tardis protocol (Section 3.4.1). We then use these lemmas to prove that our algorithm implements TSO (Section 3.4.2).
Invariants of Tardis protocol
The invariants of Tardis introduced in this section are true regardless of the memory model used by the system. Some of the lemmas come from a proof of Tardis SC ( [10] ). But we restate them in the language of physiological time.
To simplify the discussion, we first introduce the concept of master and snapshot cacheline.
DEFINITION 1 (MASTER AND SNAPSHOT CACHELINE).
A cacheline is a master cacheline if it is in M state in an L1, or in S state in the LLC. A cacheline is a snapshot cacheline if it is in S state in an L1.
Some facts about master and snapshot cachelines: Fact 1. For each cacheline (with wts and rts), the data value comes from a previous store that happened at sts = wts. In Tardis, the only way to change wts of a cacheline is by performing a store to it and the new wts after the store always equals the the sts of the store.
Fact 2. For each address in the system, at most one master cacheline exists but multiple snapshot cachelines may exist. This is similar to the single-writer, multiple-reader (SWMR) invariant in traditional coherence protocols, but shared L1 cachelines can coexist with an exclusive L1 cacheline at the same physical time.
Fact 3 A snapshot cacheline is derived by taking a snapshot of data and timestamps of a master cacheline. In the protocol, an L1 shared cacheline may be derived from a shared response from the LLC or from downgrading a modified cacheline in an L1. In both cases, it comes from a snapshot of a master cacheline.
For the proof, we use (ts, pt) to represent the physiological time of an operation. LEMMA 1. For each address, the wts and rts of the master cachelines never decrease.
PROOF. In the basic Tardis protocol, no operation decreases the timestamp of a cacheline. So the wts and rts of a particular master cacheline do not decrease. When a new master cacheline is created, its timestamps are copied from a previous master cacheline through an exclusive response to L1 or a write back response to the LLC. And the timestamps of the new master cacheline are no less than those of the previous master cacheline.
LEMMA 2. For a master cacheline, no store to the address has happened at (wts , wpt ) such that (wts, wpt) < (wts , wpt ), where the wts and wpt are the commit timestamp and the physical commit time of the store that created the data in the cacheline.
PROOF. If such a store has ever happened, it would have created a version of master cacheline with write timestamp wts > wts at an earlier physical time. However, Lemma 1 states that the wts of the master cachelines do not decrease. This contradicts with the fact that the wts of the cacheline is currently less than wts .
LEMMA 3. For a snapshot cacheline at physical time pt, no store has happened at (ts, pt ) such that (wts, wpt) < pt < (rts, pt), where the wts and wpt are the commit timestamp and the physical commit time of the store that created the data in the cacheline.
PROOF.
A snapshot cacheline is always copied from a master cacheline. When the snapshot is taken, according to Lemma 3, no store to the address exists at a physiological time after (wts, wpt). Since then, according to Tardis rules, a store can only happen at a logical timestamp greater than rts of the cacheline. Therefore, any later store has a greater physiological time then (rts, inf). So no store can exists between (wts, wpt) and (rts, pt) < (rts, inf).
Tardis TSO Proof
We first make the following assumption about a processor implementing TSO. 
This assumption does not require extra hardware changes to existing processors. In fact, processors today implementing TSO already follow this assumption. THEOREM 1. The protocol described in Section 3.2 implements TSO.
PROOF (THEOREM 1). We will prove that each invariant in the definition of TSO is maintained in our protocol.
Proof for Invariant 1. According to Section 3.2.1, both lts and sts monotonically increase and for a store, its sts is no less than the current lts. In other words,
Combined with our assumption of a TSO processor, we have:
X ≤ ts Y and X < pt Y ⇒ X < ts Y or (X = ts Y and X < pt Y) ⇒ X < pl Y ⇒ X < m Y which finishes the proof of Invariant 1 of TSO. Proof for Invariant 2. We will prove that for each memory load, the returned value is the one specified by Invariant 2 of the TSO definition. According to Fact 1, for each cacheline in Tardis, the data value was created by a store that happened at the cacheline's wts. So a load to a cacheline observes the value of this particular store. Assuming that the store committed at physical time wpt, then we need to prove that the physiological time of this store, which is (wts, wpt), is greater than all the other stores in the set S = S 1 ∪S 2 where S 1 = {S(a)|S(a) < m L(a)} and S 2 = {S(a)|S(a) < p L(a)} and that the observed store is also in this set. Specifically, we consider the following three cases.
Case 1, load to a shared L1 cacheline. The timestamps of the cacheline are wts and rts. Because wts ≤ lts ≤ rts and wpt < pt, the observed store must be in the set S 1 . According to Lemma 3, at physical time pt, no store has happened at (wts , wpt ) such that (wts, wpt) < (wts , wpt ) < (rts, pt). As a result, the observed store has the largest physiological time in set S 1 .
We now prove that all stores in set S 2 have smaller physiological time than the observed store. This can be proven by contradiction. If such a store has been executed by the current core at (wts , wpt ) > (wts, wpt), the current core must have owned a master cacheline with wts after the store. Since the master cacheline's wts never decreases and a snapshot cacheline is copied from a master cacheline, the current shared cacheline must have wts greater than wts , contradicting the assumption. So the observed store has the largest physiological time in set S 1 ∪ S 2 , proving the invariant.
Case 2, load to a modified L1 cacheline. In this case, the loaded cacheline is a master cacheline. If the cacheline is not dirty, then wts ≤ lts. So the observed store is in S 1 . If the cacheline is dirty, then the observed store must be before the current load in the program order and is therefore in S 2 . According to Lemma 2, no store has happened to the address at (wts , wpt ) > (wts, wpt). So the observed store must have the largest physiological time in set S = S 1 ∪ S 2 .
Proof for Invariant 3. In Tardis TSO, a fence synchronizes lts and sts. Using fts and fpt as the commit timestamp and physical commit time for the fence respectively, Tardis TSO enforces the following:
Combining these equations and applying the definition of physiological time, we prove the last invariant of TSO.
Other Relaxed Consistency Models
Similar to TSO, other memory consistency models may also be supported in Tardis. Partial Store Order (PSO) [12] , for example, relaxes the Store → Store dependency a top of TSO. In Tardis, this means that the sts no longer needs to monotonically increase. And a store can happen as long as its timestamp is no less than the lts of the last load.
Release consistency [17] further relaxes all the memory order constraints and also relaxes the ordering constraints for synchronizations. An acquire guarantees that all the following (but not previous) operations are orderd after the acquire operation and a release guarantees that all the previous operations are ordered before the release. The memory ordering constraints in the release consistency model can also be easily expressed in Tardis using timestamps.
RENEWAL REDUCTION TECHNIQUES
As discussed in Section 2, a major drawback of the original Tardis protocol is the renewal problem. Renew messages are sent when loading an expired cacheline in the L1 and the load can commit only after the renew response comes back from the LLC. This incurs extra latency and network traffic in the system. In this section, we discuss techniques to reduce the number of unnecessary renew messages.
Livelock Detection
One disadvantage of using physiological time to order memory operations is that propagating a store to other cores may take an arbitrarily long time. This is because the writer does not push the update to a sharing core, which then may not pull the latest value if it keeps reading the stale version. In the worst case if a core spins on a stale cacheline with a small lts (or pts in SC), it can never see the latest update and livelock occurs (Listing 3). This spinning behavior is very commonly used for communication in parallel programs. 
Baseline: Periodic Self Increment
The original Tardis protocol solves the livelock problem by self incrementing the pts (or lts in TSO) periodically, and forces the logical time in each core to move forward. For a spinning core (e.g., core 0 in Listing 3), the lts will increase and eventually become greater than the rts of the cacheline being spun on at which point the line expires and the latest value will be loaded. Frequently self incrementing the lts causes two performance issues. First, the cachelines in the private cache may frequently expire, generating renewals. For programs without spinning, these renewals are unnecessary and incur network traffic and latency overhead. Second, for cachelines that have a rts much larger than the current lts of the core, it may take a significant amount of time before the lts increases to rts to expire the stale cacheline.
Livelock Detector
In this paper, we make a key observation that a message can be sent to check the freshness of a cacheline before the cacheline actually expires. We introduce a new type of message called check message to check the freshness of a cacheline in the LLC. Like a renew request, if the latest data is newer than the cached data, the latest cacheline is returned. If the cached data is already the latest version, however, a check response is returned without changing the rts of the cacheline in the LLC.
The check request can resolve both drawbacks of the self incrementing scheme. Since a core does not need to frequently increase its lts, the number of renewals can be significantly reduced. A check request can be sent when the lts is much smaller than rts, so a core does not need to wait for a long time for the cacheline to expire.
Generally, a check request should be sent when the program seems to livelock since it keeps loading stale cachelines. In practical programs, such a livelock is usually associated with variable spinning, where a thread spins on a variable that will be set by another thread. This spinning typically involves a small number of cachelines and therefore is easy to detect in hardware. We designed a small piece of hardware next to each core to detect livelock. The structure of a livelock detector is shown in Fig. 2 . It contains an Address History Buffer (AHB) and a threshold counter (thresh_count). The AHB is a circular buffer keeping track of the most recently accessed addresses. Each entry in AHB contains the address of a memory access, and an access_count which is the number of accesses to the address since it was loaded to AHB. When access_count becomes greater than the thresh_count, a check request is sent for this address (Algorithm 1). The value of thresh_count can be static or dynamic. We chose to use an adaptive threshold counter scheme (Algorithm 2) in order to minimize the number of unnecessary check messages. The livelock detection algorithm (Algorithm 1) is executed when loading a shared cacheline. It is not called when accessing exclusive cachelines since no livelock can occur for those accesses. If the accessed address does not exist in AHB, a new entry is allocated. Since AHB is a circular buffer, this may evict an old entry from it. (We assume LRU replacement policy here but other replacement policies should work equally well.) If the accessed address does exist in AHB, then the access_count is incremented by 1. When the counter reaches thresh_count, a check request is sent and the access_count is reset. All access_counts are reset to 0 when the lts increases due to a memory access. Because this indicates that the core is not livelocking, and thus there is no need to send checks.
The counter thresh_count may be updated for each check response (Algorithm 2). If the checked address was updated by another core, then thresh_count should decrease to the minimal value, meaning that check requests should be sent more frequently since data seems to be updated more frequently. Otherwise if check_thresh number of consecutive check requests returned without data being changed, then thresh_count is doubled since it appears to not be necessary to send check requests that often. Adaptively determining the value of thresh_count can reduce the number of unnecessary check requests if a thread needs to spin for a long time before the data is updated.
Note that the livelock detector can only detect spinning involving less than M (the number of entries in AHB) memory reads, since the AHB can only remember the last M distinct accessed addresses. So in theory, the livelock detector cannot capture all possible livelocks and self incrementing lts is still required to guarantee correctness. For practical programs, however, spinning typically involves only 1 or 2 memory addresses (e.g., the benchmarks evaluated in this paper). So the livelock detector is able to capture livelock in the vast majority of programs. We still self increment lts periodically but the frequency can be much lower, since most programs' correctness does not rely on this mechanism any more.
Lease Prediction
During regular operation of Tardis, memory stores are the fundamental reason that the timestamps increase in the system. The amount that the sts increases is determined by the lease of the previous data version, because the wts of the cacheline after the store must be no less than the previous version's rts. Therefore, the lease of each cacheline is very important to the timestamp incrementing rate as well as the renew rate. The original Tardis protocol uses a static lease for every shared cacheline. In this section, we show that a static leasing policy may incur unnecessary renewals in (Section 4.2.1). We propose a dynamic leasing policy that can reduce the number of unnecessary renewals (Section 4.2.2).
Static Leasing vs. Dynamic Leasing
In the code snippet shown in Listing 4, both cores are running the same program. They both load addresses A and B and then store to address B. When the cachelines are loaded to L1 caches, they are all reserved with a lease L, assuming that the lease is static. When the store to address B is performed at both cores, both sts will jump ahead by at least L. At the end of the loop, the FENCE instruction increases lts to the value of sts. In the next iteration when both cores load address A again, the cacheline will expire in the L1 cache. This is because the lts of the core has already jumped ahead by at least L due to previous store to B and the fence, but the lease of cacheline A was only reserved by L. The end result is that in each iteration of the loop, lts and sts jump ahead by L and cacheline A has to be renewed in each core. All these renewals to cacheline A will be successful, and therefore unnecessary, since cacheline A has never been changed. Note that using a larger static L does not solve the problem. With a larger L, the sts jumps ahead further to perform the store and cacheline A will still expire.
Our solution to this problem is to use different leases for different addresses. Intuitively, we want to use large leases for read only or read intensive data, and use small leases for write intensive data. In the example in Listing 4, if cacheline A has lease 100 and cacheline B has lease 10, then each store to B increases the sts and lts only by 10. So it takes about 10 iterations before cacheline A has to be renewed again. Note that the renew rate of cacheline is only a function of the ratio between these two leases. It does not depend on the absolute value of the leases.
In a real system, it is non-trivial to decide what data should have a larger or smaller lease. In this paper, we will explore hardware only solutions and design a predictor deciding what the correct lease should be for each cacheline. It is also possible to decide the leases with software support. Such explorations are left for future work.
Lease Predictor
Our lease predictor is based on the observation that cachelines that are frequently renewed are more likely to be read intensive. Therefore, a cacheline should be reserved with a larger lease if it is renewed more frequently. The logic to determine the lease value is built in the LLC, since the LLC has the knowledge of all the renewals.
For each renew request from the L1 to the LLC, the last lease of the cacheline is also sent to the LLC and is received by the lease predictor. For an L1 miss, there is no last lease on the cacheline so a minimal lease is sent. The lease predictor decides what lease should be returned to the requester based on the request lease (req_lease) and the predictor's current internal lease (cur_lease). Specifically, the algorithm of our lease predictor is shown in Algorithm 3. For a write request, the cur_lease is updated to the minimal lease value (min_lease). Because the write indicates that the cacheline might be write intensive so assigning a large lease to it may cause unnecessary renewals. For a normal read request, the cur_lease is used for the requested cacheline. For a renew request, the cur_lease is compared with the request lease (req_lease), which is the lease of the cacheline in its last read or renew request. If the two leases are different, then cur_lease is used for the cacheline. Otherwise, cur_lease is doubled since the cacheline seems to be renewed multiple times by the same core and is therefore likely to be read intensive. If cur_lease already reached the maximal value (max_lease), then it should no longer increase.
Algorithm 3: Lease Prediction Algorithm
The initial value of cur_lease is the minimal lease value (min_lease). This means that for a cacheline first loaded to the LLC, we always assume that it is write intensive. We made this design decision because incorrectly giving a large lease to a write intensive cacheline is much more expensive than giving a small lease to a read intensive cacheline. If a cacheline with a large lease is written, then a large number of cachelines in the core's L1 might expire due to the program timestamp jumping ahead. In contrast, if a read only cacheline is given a small lease, only this cacheline needs to send renewals later and other cachelines are not affected.
EVALUATIONS
In this section, we evaluate the performance of Tardis with TSO and the optimizations proposed in Section 4. 
Methodology

System Configuration
We use the Graphite [18] multicore simulator to model the Tardis coherence protocol. The hardware configuration is shown in Table 1 . The configurations of the baseline Tardis, the livelock detector and the lease predictor are shown in Table 2 .
For the baseline Tardis, we implemented all the optimizations in the original Tardis protocol, including speculative reads when a cacheline expires in the L1 cache and not incrementing sts for private writes [9] . The static lease always equals 8. And the lts (or pts in SC) self increments by 1 for every 100 memory accesses.
For the livelock detector, the address history buffer (AHB) by default contains 8 entries. The threshold counter can take values ranging from 100 to 800. The threshold counter is doubled if 10 consecutive checks respond that the data has not been changed.
The minimal lease is chosen to be 8 and the maximum lease is 64. There are 4 possible lease values (i.e., 8, 16, 32, 64) in the system.
Baselines
The following coherence protocols are implemented and evaluated for comparison.
Directory: Full-map MSI directory coherence protocol. Base Tardis: Baseline Tardis without livelock detector and without lease predictor. lts self increments by 1 for every 100 memory accesses.
Tardis + live: Baseline Tardis with livelock detector. The lts self increments by 1 for every 1000 memory accesses.
Tardis + live + lease: Tardis with both livelock detector and lease predictor.
In Tardis, the base-delta timestamp compression scheme [9] was implemented and each timestamp requires 20 bits storage. So each cacheline requires 40 bits in total for wts and rts, regardless of the number of cores in the system. The fullmap directory protocol, in contrast, requires an N-bit sharer list for each cacheline in the LLC for an N-core system. Our experiments are executed over 20 benchmarks selected from Splash2 [19] , PARSEC [20] , sparse linear algebra [21] and OLTP database applications [22] . For sparse linear algebra, we evaluated sparse matrix multiplicaion (SPMV) and symmetric Gauss-Seidel smoother (SYMGS). Both are from the HPCG benchmark used for Top 500 supercomputer ranking. For OLTP database, we evaluated two benchmarks YCSB and TPCC. All benchmarks are executed to completion. Fig. 3 shows the speedup of directory coherence and Tardis running SC and TSO. All numbers are normalized to the directory protocol with SC. Tardis in this experiment has both the livelock detection and lease prediction enabled. On both protocols, TSO performs better than SC since loads can bypass pending stores in the store buffer. On average, TSO improves the performance of the directory protocol by 2.9% (upto 12.3%) and Tardis by 3.0% (upto 12.3%). And Tardis TSO outperforms directory TSO by 1% on average (upto 6.2%).
Tardis TSO
As discussed in Section 3, TSO can also reduce the renew rate in Tardis. We define renew rate as the ratio of the number of renew requests over the total number of LLC accesses. Fig. 4 shows the renew rate reduction of Tardis TSO compared to Tardis SC. TSO can reduce the renew rate because lts does not increase for memory stores. As a result, a cacheline that would expire in Tardis SC may be valid in Tardis TSO. On average, the renew rate is reduced from 15.7% to 12.4%. And the reduction can be up to 20× on certain benchmarks (e.g., OCEAN-C). This leads to 1.5% (up to 5.5%) reduction in total network traffic. In the directory coherence protocol, however, TSO only improves performance but does not have any traffic reduction.
Although not shown in these figures, TSO can also significantly decrease the rate at which timestamps increase. This is because lts can stay behind sts. And therefore lts and sts may increase slower than what pts does in Tardis SC. On average, the timestamp increment rate in Tardis TSO is only 47% of the rate in Tardis SC.
Livelock Detector and Lease Predictor
We evaluate the performance and hardware overhead of the livelock detector and lease predictor of Section 4. First, we see that CHOLESKY and SYMGS on baseline Tardis has much worse performance than the directory-based protocol. Both benchmarks heavily use spinning to communicate between different threads. As a result, it may take a long time for the cacheline spun on to expire. The livelock detector can close the performance gap between baseline Tardis and baseline directory because a spinning core is able to observe the latest data much earlier. With both optimizations, the performance of Tardis can be improved by 15.8% on average. Most of the performance improvement comes from CHOLESKY and SYMGS where the baseline Tardis performs very badly. For some benchmarks (e.g., OCEAN-NC, TPCC), Tardis with livelock detection slightly hurts performance compared to the baseline Tardis. This degradation is because of the lower self increment rate of lts and is not due to livelock detection. Lower self increment rate reduces the number of renewals, and thus the LLC has less accurate LRU information for each cacheline. Therefore, more LLC misses are generated. Fig. 6 shows the network traffic breakdown for the same four configurations as in Fig. 5 . For each experiment, we show dram traffic, common traffic, renew traffic and invalidation traffic. Common traffic is the traffic in common for both directory coherence and Tardis, including shared, exclusive and write back memory requests and responses. Renew traffic is specific to Tardis and includes renew and check requests and responses. Invalidation Traffic is specific to the directory-based protocol, including the invalidation requests to shared copies from the directory, as well as the messages sent between L1 and LLC when a shared cacheline is evicted. Tardis is able to remove all the invalidation traffic in a directory coherence protocol. However, the renew traffic adds extra overhead. The baseline Tardis configuration incurs a large amount of renew traffic on certain benchmarks (e.g., WATER-SP, FMM, CHOLESKY and VOLREND). Some of the renew traffic is due to fast self incrementing lts (e.g., WATER-SP and CHOLESKY) which expires shared cachelines in the L1 cache. For these benchmarks, the livelock detection scheme can significantly reduce the self increment rate and therefore reduce the amount of renew traffic. On average, the livelock detection algorithm is able to reduce the total network traffic by 5.6% (up to 34.6%) compared to the baseline Tardis. This makes Tardis's network traffic 4.3% (up to 14.8%) lower than that of the baseline directory protocol.
Performance and Network Traffic
For many benchmarks, shared cachelines expire because the lts jumps ahead due to a write (e.g., VOLREND) and renew messages are generated. Our lease prediction algorithm is able to reduce these unnecessary renewals by using a larger lease for read intensive cachelines. On top of the livelock detection optimization, lease prediction further reduces the total network traffic by 1.3% on average (up to 8.3%). With both livelock detection and lease prediction, Tardis can reduce the total network traffic by 5.6% compared to the baseline directory protocol.
Hardware Complexity
The hardware overhead for the livelock detector and lease predictor is very small. Each livelock detector contains 8 AHB entries and each entry requires an address and a counter. Assuming 48-bit address space and a counter size of 2 bytes, the detector only requires 64 bytes storage per core.
To implement the lease predictor, we need to store the current lease for each LLC and L1 cacheline. The lease is also transferred for each shared request and response. However, there is no need to store or transfer the whole lease value. Since a lease can only take one of 4 possible values, we can use 2 bits to encode a lease. As a result, the storage overhead is less than 0.4% in the LLC and in each L1.
Sensitivity Study
We run more experiments in this section to provide more insights into our optimization techniques. Fig. 7 shows the performance and network traffic of Tardis sweeping the self increment period with and without livelock detection (LL Detect). All numbers are normalized to a baseline directory protocol. The Base Tardis Self 100 corresponds to the baseline Tardis configuration and LL Detect Self 1000 is the default optimized Tardis configuration.
Self Increment Rate
In WATER-SP, sweeping the self incrementing rate does not change the performance regardless whether livelock detection is turned on or not. This is because WATER-SP does not have spinning and therefore renewals are always unnecessary. Having a large self increment period can significantly reduce the number of unnecessary renewals as well as the total network traffic.
In SYMGS, for Tardis without livelock detection, performance is very sensitive to the self increment period because SYMGS intensively uses spinning to communicate between threads. If self increment is less frequent, a thread waits longer to expire the stale data and learn the latest value, and thus performance degrades. With the livelock detector, however, check requests are sent when spinning (potential livelocks) is detected. So the latest value of a cacheline spun on can be returned much earlier. Regardless of the self increment period, Tardis with the livelock detector can always match the performance of the baseline directory protocol. Fig. 8 sweeps the number of entries in the address history buffer in a livelock detector for CHOLESKY and SYMGS. According to the results, as long as the AHB buffer size is no less than 2, performance does not change. This is because in practical programs, spinning only involves a very small number of distinct memory addresses. CHOLESKY only spins on two addresses and SYMGS only spins on one address. We used a buffer size 8 as default but smaller buffers also work. AHB has been accessed by thresh_count times, a check request is sent. So with a larger thresh_count, checks are sent after spinning for a longer time. As a result, performance may degrade due to this extra latency. On the other hand, larger thresh_count can also reduce the total number of check messages and reduce network traffic. In practice, the thresh_count should be chosen to balance the tradeoff. We chose 100 as the default threshold counter.
Address History Buffer Size
Livelock Threshold Counter
Scalability
Finally, Fig. 10 shows the performance and network traffic of all benchmarks running on a 256-core system. On average, Tardis with our optimizations outperforms the baseline Tardis protocol by 8.4% and reduces the network traffic by 6.3%. Compared to the baseline directory protocol, the optimized Tardis outperforms by 3.4% and reduces the network traffic by 6.1%.
Note that at 256 cores, the performance improvement and traffic reduction of Tardis compared to the baseline directory protocol is even greater than the 64 core case. This indicates that Tardis not only has better scalability in terms of storage as core count increases, it also scales better in terms of performance and traffic.
RELATED WORK
Memory coherence is an important issue in shared memory systems with private storage in each core or processor. It has been widely studied and implemented in multicore processors [3, 4] , multi-socket systems [23, 24] and distributed shared memory systems [25, 26] . Traditional directory or snoopy based coherence protocols enforce the global memory order in a consistency model using physical time order. They need an invalidation mechanism to guarantee correctness, and therefore either require non-scalable storage overhead (e.g., directory-based protocols) or broadcasting in the network (e.g., snoopy-based protocols).
Numerous previous works have tried to improve the scalability of directory coherence protocols [27, 7, 28, 8, 29, 5, 6] . Most of these works focused on better ways to organize the directory structure to improve scalability. Compared to the full-map directory protocol, these optimizations usually hurt performance and increase the design and verification complexity.
Previous works have also proposed to use timestamps in coherence protocol design. Some of these works require a globally synchronized clock and still assume that memory consistency is enforced in physical time order [30, 31] . Other papers have proposed to use lazy coherence to eliminate the need of the invalidation mechanism. These schemes are based on the insight that memory orders need to be enforced only at synchronization boundaries for release consistency [32, 33, 34] or for TSO [35] . Unlike Tardis [9] , these schemes cannot run sequential consistency and the implementation is also more complex.
This paper generalizes the Tardis protocol beyond the sequential consistency model and also improves Tardis' performance. Similar to our livelock detector, TSO-CC [35] has a mechanism to prevent livelock caused by indefinitely loading stale data in a private cache. However, their solution requires a counter for each cacheline and is therefore more expensive than our livelock detector.
CONCLUSION
In this paper, several optimization techniques have been applied to Tardis, a very scalable physiological time based cache coherence protocol. Total store order (TSO) memory model is supported in Tardis and two optimizations (livelock detection and lease prediction) are added. Compared to a baseline directory protocol implementing TSO, Tardis with our optimizations improves performance by 1% and reduces network traffic by 5.6% on a 64-core system. At 256 cores, the performance improvement goes up to 3.4% and the traffic reduction goes to 6.1%. Other advantages of Tardis are the reduction in storage and the simplicity of the protocol. On this set of benchmarks, we conclude that Tardis is better than a fullmap directory protocol in terms of performance, energy and storage while being simpler.
