A new memory coherence protocol, TARDIS, is proposed. TARDIS uses timestamp counters representing logical as opposed to physical time to order memory operations and enforce memory consistency models in any type of shared memory system. Compared to the widely-adopted directory coherence protocol, TARDIS is simpler, only requires O(log N ) storage per cache block for an N -core system rather than the O(N ) sharer information required by conventional directory protocols, and integrates better with some system optimizations. On average, TARDIS achieves similar performance to directory protocols on a wide range of benchmarks.
INTRODUCTION
Shared memory systems are ubiquitous in parallel computing. Examples include multi-core and multi-socket processors, and distributed shared memory systems (DSM). The correctness of these systems is defined by the memory consistency model which specifies the legitimate interleaving of operations from different nodes (e.g., cores or processors). In practice, the consistency model is maintained through the coherence protocol. For a shared memory system, the coherence protocol is the key component to ensure performance and scalability.
When the data can be cached in the local memory of a node, most shared memory systems today adopt directory based coherence protocols [29, 3] . Examples of coherence between multi-socket systems are Intel's QPI [34] and AMD's HyperTransport [2] , and examples of distributed shared memory systems are IVY [18] and Treadmarks [12] . These protocols keep a list of nodes (sharers) caching each data and send invalidations to sharers before the data is modified by some node. A well known challenge in directory coherence protocol is latency and scalability. Waiting for all invalidation requests to be acknowledged may take a long time and storing the sharer information or supporting broadcasting does not scale well as the number of nodes increases.
Previous works have proposed more scalable directory coherence protocols [1, 16, 4, 20, 10, 5, 6, 13, 24] . However, these schemes typically add complexity to the base protocol, and sacrifice performance or require broadcasting or require software support. Other works have proposed alternative protocols to directory for better scalability [22, 23, 9, 19, 27] . These schemes usually achieve worse performance than directory protocols or only apply to particular memory consistency models. Router' Figure 1 : Architecture of a shared memory multicore processor.
We propose a new coherence protocol, TARDIS, which is simpler and more scalable than directory coherence, but has equivalent performance and generality with respect to different consistency models. TARDIS directly expresses the memory consistency model by explicitly enforcing the global memory order using timestamp counters that represent logical as opposed to physical time; it does not require a globally synchronized clock unlike prior timestamp coherence schemes (e.g., [19, 27] ). In TARDIS, only the timestamps and the owner ID need to be stored for each address for a O(log N ) cost where N is the number of processors or cores; the O(N ) sharer information is not required. TARDIS is compatible with commonly used memory consistency models and integrates more easily with remote word access operations than directory protocols.
We evaluated TARDIS in the context of multicore processors. Our experiments showed that TARDIS achieves similar or better performance than its directory counterpart over a wide range of benchmarks. Due to its simplicity and excellent performance, we believe TARDIS is a competitive alternative to directory coherence for massive-core and DSM systems.
We provide background in Section 2, describe the basic TARDIS protocol in Section 3, optimizations to the basic protocol in Section 4, and extensions in Section 5. We evaluate TARDIS in Section 6, discuss related work in Section 7 and conclude the paper in Section 8.
BACKGROUND
In this section, we provide some background on shared memory systems as well as memory consistency and coherence. Fig. 1 shows a tiled architecture of a shared memory multicore system. Each tile of the system has a core, a private arXiv:1501.04504v1 [cs.DC] 19 Jan 2015 cache and a slice of shared last level cache (LLC). Tiles are connected through links and routers. For a load or a store request, the private cache is first probed; if a cache miss occurs, then a request is sent to the home core of the cacheline; if the access misses again, then a request is sent to the DRAM.
Shared Memory System
Although other shared memory systems (e.g., multi-socket systems or DSM) may have a different organization from a multicore processor, they all have a physically distributed but logically shared memory space. For the sake of clarity, in the rest of the paper, we use a multicore processor as a case study. But our discussion applies equally well to other types of shared memory systems.
Sequential Consistency
A memory consistency model defines the correctness of a shared memory system. Specifically, it defines the legitimate behavior of memory loads and stores. Although a large number of consistency models exist, we will focus on Sequential Consistency due to its simplicity.
Sequential Consistency was first proposed and formalized by Lamport [17] . A parallel program is sequentially consistent if "the result of any execution is the same as if the operations of all processors(cores) were executed in some sequential order, and the operations of each individual processor(core) appear in this sequence in the order specified by its program". If we use <p and <m to denote program order and global memory order respectively, sequential consistency requires the following two rules to be held [31] :
Rule 1: where L(a) is a load to address a and S(a) is a store to address a; the M ax<m operator selects the most recent operation in the global memory order.
Rule 1 claims that if an operation X (a load or a store) is before another operation Y according to the program order of any core, X must be before Y in the global memory order. Rule 2 says that a load to a cacheline should return the value of the most recent store to that cacheline with respect to the global memory order.
Directory Coherence
In practice, each core in the system has a private local cache ( Fig. 1 ). In this case, a memory coherence protocol is used to guarantee the consistency model.
Virtually all shared memory systems today use some variant of directory memory coherence which can satisfy the two rules of sequential consistency. The directory is a software or hardware structure tracking how the cachelines are shared or owned by different cores. In directory coherence, the second rule of sequential consistency is achieved through the invalidation mechanism; when a core writes to a cacheline that is shared by other cores, all the shared copies need to be invalidated before the write can happen. Future reads to that cacheline have to send requests to the directory which returns the value of the last write. This mechanism essentially guarantees that reads that happen after the last write with respect to physical time can only observe the value of the last write (the second rule of sequential consistency).
The directory needs to keep the sharer information of each cacheline in order to correctly deliver the invalidations. If the system has N cores, this requires O(N ) storage per cacheline, which does not scale well as the number of cores increases. An alternative solution to reduce the storage overhead is to use broadcasting if the number of sharers exceeds the capacity of the directory [16] , however, broadcasting itself is not a very scalable mechanism. 1 
TARDIS
We present a new coherence protocol, TARDIS, which only requires O(log N ) storage per cacheline and requires neither broadcasting support nor a globally synchronized clock across different cores or caches. TARDIS works for all types of distributed shared memory systems and is compatible with any memory consistency model. We present the theoretical basis of TARDIS in Section 3.1 and its implementation in Section 3.2 and Section 3.3. TARDIS is compared to directory coherence in Section 3.4 and the backwards compatibility of TARDIS is discussed in Section 3.5.
Timestamp Ordering
In directory coherence (cf. Section 2.3), the global memory order (<m) is enforced through the physical time order. i.e., if X <m Y where one of X and Y is a store, then the operation X happens physically before Y . In TARDIS, we break the correlation between the global memory order and the physical time order and explicitly express the global memory order using logical timestamps. Each memory operation in TARDIS has an associated timestamp indicating the global order of the memory operation. So X <m Y is equivalent to X <ts Y or ts(X) < ts(Y ). The two rules of SC can then be interpreted using the following timestamp notation:
Rule 1 (version 1): X <p Y ⇒ X <ts Y . The rules above require each memory operation having a unique timestamp which is actually unnecessary. Operations can have the same timestamp in the following three cases. First, we can define S(a) =ts L(a) ⇒ S(a) <p L(a). Second, operations with no order constraint can have the same timestamp (e.g., independent operations to different addresses, or concurrent loads to the same address with no store in between). Finally, operations from the same core can have the same timestamp as long as the core enforces memory order and other timestamp constraints are not violated. With this observation, the two rules above can be transformed into the following optimized rules.
Rule 1 (version 2): X <p Y ⇒ X ≤ts Y and X <m Y . Rule 3 (version 2): S1(a) =ts S2(a) if S1 and S2 are from different cores.
In the new rules, the timestamps of the operations from the same core only need to be monotonically increasing. And the core enforces memory order for those operations with the same timestamp. This is not hard in practice since only a single core is involved and existing processors already enforce this rule. Since now operations can have the same timestamp, a third rule is required to make sure stores from different cores to the same address having different timestamps; memory order of stores from the same core is already enforced by the first rule.
Henceforth in the paper we will focus on the optimized rules (version 2) above.
TARDIS without Private Cache
In TARDIS, timestamps are maintained as logical counters. Each core keeps a pts as the timestamp of the last operation from this core. Each cacheline keeps a read timestamp (rts) and a write timestamp (wts). The rts equals the largest timestamp among all the loads of the cacheline thus far and the wts equals the timestamp of the latest store to the cacheline. The pts should not be confused with the processor clock, it does not self increment and is only incremented when accessing data. The directory structure is replaced with a timestamp manager. Any load or store request to the LLC should go to the timestamp manager.
For illustrative purposes, we first show the TARDIS protocol assuming no private cache and that all the data fits in the LLC. Each cacheline only has a single copy on the chip and all the memory requests go to that unique copy. The protocol with private caches and DRAM will be presented later in Section 3.3. Table 1 shows the rules of timestamp management. Each memory request contains the pts of the core it comes from. The final value of pts at the requesting core is the assigned timestamp of the current operation. For a load request, the timestamp manager returns the value of the last store. According to Rule 2, the load timestamp must be equal to or greater than the timestamp of the last store which is wts. If pts < wts then the pts must bump up to wts, meaning that the load happens at logical timestamp wts. If pts ≥ wts, Rule 2 is already satisfied and thus pts does not need to increase. If pts > rts, then rts bumps up to pts since the current load request has a larger timestamp.
For a store request, the latest load of the cacheline (at rts) did not observe the value of the current store. According to Rule 2, the timestamp of the current store must be greater than the rts of the cacheline (the timestamp of the last load). If pts ≤ rts, the pts should bump up to rts + 1; wts and rts should also bump up to the final pts. If pts > rts, then Rule 2 is already satisfied so we only need to bump wts and rts to the pts. Note that Rule 3 is also satisfied in this process since each store at the LLC will increase the wts.
In all possible cases, Rule 2 and Rule 3 are held throughout the protocol. Rule 1 is also held since the pts of a particular processor can never decrease. As a result, the TARDIS protocol in Table 1 guarantees sequential consistency.
TARDIS with Private Cache
With a private cache in each core, the TARDIS protocol introduced in Section 3.2 largely remains the same. Like the previous protocol, each cacheline in the private cache also has a read timestamp (rts) and a write timestamp (wts). However, two extra mechanisms are added on top of the previous protocol.
Timestamp Reservation: Unlike the previous protocol where a load happens at a particular timestamp, timestamp reservation allows a load to reserve the cacheline in the private cache for a range of timestamps. The end timestamp of the reservation is stored in rts. The length of the reservation is the lease of the cacheline in the private cache. The cacheline can be read until the timestamp expires (pts > rts). If the cacheline being accessed has already expired, then a request is sent to the timestamp manager to extend the lease.
Exclusive Ownership: In order to cache data for writing, TARDIS allows a cacheline to be exclusively owned by a core just like in the directory coherence protocol. In the timestamp manager, the cacheline is in exclusive state and the owner of the cacheline is also tracked. (This requires log(N ) bits of storage.) The data can be accessed freely by the owner core as long as it is in the exclusive state; and the timestamps are properly updated with each access. If another core later accesses the same cacheline, a write back (the owner continues to cache the line in shared state) or flush request (the owner invalidates the line) is sent to the owner which sends back the latest version of the data.
The state transition and the timestamp management of TARDIS with private cache are shown in Table 2 and Table 3 . Table 2 shows the state transition at the private cache and Table 3 shows the state transition at the shared timestamp manager. Table 4 shows the network message types used in the TARDIS protocol. In this section, we only consider the simplest TARDIS protocol.
In the protocol, each cacheline (denoted as D) has a read timestamp (D.rts) and a write timestamp (D.wts). Some network messages (denoted as M ) also have timestamps associated with them. Although it is possible to compress the messages such that each message only has two timestamp fields, we do not apply such optimizations here for clarity.
Most of Table 2 and Table 3 is just a simple extension of Table 1 and should be easy to understand. We now discuss different cases of the TARDIS protocol shown in both tables. (Table 2) Load to Private Cache (column 1, 4, 5): A load to the private cache is considered as a hit if the cacheline is in exclusive state or is in shared state and has not expired (pts ≤ rts). Otherwise, a SH REQ is sent to the timestamp manager to reserve the data or to extend the existing lease. The SH REQ has the current wts of the cacheline indicating the version being cached.
State Transition in Private Cache
Store to Private Cache (column 2, 4, 5): A store to the private cache can only happen if the cacheline is exclusively owned by the core. Same as directory coherence, EX REQ is sent to the timestamp manager for exclusive ownership. The rts and wts of the private data are updated for each store accordingly.
Eviction (column 3): Evicting shared cachelines does not require sending any network message. The cacheline can be simply invalidated. Evicting exclusive cachelines is the : Exclusive cachelines in the private cache may receive flush or write back requests from the timestamp manager when the cacheline is evicted from the LLC or requested by other cores. A flush is handled similarly to an eviction where the data is returned and the line is invalidated. For a write back request, the data is also returned but the line becomes shared.
State Transition in Timestamp Manager
( Table 3) Shared Request to Timestamp Manager (column 1):
If the cacheline is invalid, it must be loaded from DRAM first. If it is exclusively owned by another core, then a write back request is sent to the owner. When the cacheline is eventually in the Shared state, it is reserved for a period of timestamp by setting the rts to be the end timestamp of the reservation, and the line can be read from wts to rts.
If the rts of the request equals the rts of the cacheline in timestamp manager, the data in the private cache must 
be the same as the data in the timestamp manager. So the data does not need to be returned and a RENEW REP is sent back to the requester. Otherwise SH REP is sent back with the data. Exclusive Request to Timestamp Manager (column 2): An exclusive request can be either an exclusive load or exclusive store. Similar to directory coherence, if the cacheline is invalid, it should be first loaded from DRAM; if the line is exclusively owned by another core, a flush request should be sent to the owner.
If the requested cacheline is in shared state, however, no invalidation messages need to be sent. The timestamp manager can immediately give exclusive ownership to the requesting core. However, the exclusive ownership is only valid from the current rts onwards. 2 Other cores can still read their local copies of the cacheline if they have not expired. This does not violate sequential consistency since the read operations in the sharing cores are ordered before the write operation in the timestamp order. If the cacheline expires in the sharing cores, they will send requests to renew the line at which point they get the latest version of the data.
If the wts of the request equals the wts of the cacheline in the timestamp manager. Then the data is not returned and an UPGRADE REP is replied to the requester.
Evictions (column 3): Evicting a cacheline in exclusive state is the same as in directory coherence, i.e., a flush request is sent to the owner before the line is invalidated. For shared cachelines, however, no invalidation messages need to be sent. Sharing cores can still read their local copies until they expire. And all memory operations can be properly ordered.
DRAM (column 3, 4): TARDIS only stores timestamps on chip but not in DRAM. This is achieved through the max timestamp (mts) which is stored per timestamp manager. mts indicates the maximal read timestamp of all the cachelines mapped to this timestamp manager but evicted to DRAM. For each cacheline evicted from the LLC, mts is updated to be M ax(rts, mts). When a cacheline is loaded from DRAM, both its wts and rts are assigned to be mts. This guarantees that accesses to previously cached data with the same address are ordered before the accesses to the cacheline just loaded from DRAM. This is the case when a cacheline is evicted from the LLC but is still cached in some core's private cache.
Flush or write back response (column 5): Finally, the flush response and write back response are handled in the same way as in directory coherence. Note that when a cacheline is exclusively owned by a core, only the owner has the latest rts and wts; the rts and wts in the timestamp manager are invalid and the bits can be reused to store the ID of the owner core.
An Example Program
We use an example to show how TARDIS works with a parallel program. Fig. 2 shows how the simple program in Listing 1 runs with the TARDIS protocol. In the example, we assume a lease of 5 and that the instructions from Core 0 are executed before the instructions in Core 1.
Listing 1: Example Program
Step 1 : The store to A misses in Core 0's private cache and an EX REQ is sent to the timestamp manager. The store operation should happen at pts = M ax(pts, A.rts + 1) = 1 and the A.rts and A.wts in the private cache should also be 1. The timestamp manager marks A as exclusively owned by Core 0.
Step 2 : The load of B misses in Core 0's private cache. After Step 1, Core 0's pts becomes 1. So the reservation end timestamp should be M ax(B.rts, pts + lease) = 6.
Step 3 : The store to B misses in Core 1's private cache. At the timestamp manager, the exclusive ownership of B is immediately given to Core 1 at pts = rts + 1 = 7. Note that two different versions of B exist in the private caches of core 0 and core 1 (marked in red circles). In core 0, B = 0 but is only valid when 0 < timestamp ≤ 6; in Core 1, B = 1 and is only valid when timestamp > 6. This does not violate sequential consistency since the loads of B at core 0 will be logically ordered before the loads of B at core 1, even if they may happen the other way around with respect to the physical time.
Step 4 : Finally the load of A misses in Core 1's private cache. The timestamp manager sends a WB REQ to the owner (Core 0) which updates its own timestamps and writes back the data. Both cores will have the same data with the same range of valid timestamps.
With TARDIS on sequential consistency, it is impossible for the example program above to output 0 for both A and B. In out-of-order execution, if both loads are scheduled before the stores, at least at one core, the timestamp of the store will be greater than the timestamp of the load and thus violating the first rule of sequential consistency. TARDIS can easily detect this violation and reissue the load request with the new pts and get the correct value.
TARDIS vs. Directory Coherence
In this section, we compare the TARDIS protocol with the directory coherence protocol. The advantages and disadvantages of TARDIS are summarized in Table 5 .
Protocol Messages
In Table 2 and Table 3 , the advantages and disadvantages of TARDIS compared to directory are shaded in light green and light red, respectively. Both schemes have similar performance in the other state transitions (the white cells). Invalidation: In a directory coherence protocol, when the directory receives an exclusive request to a Shared cacheline, the directory sends invalidations to all the cores sharing the cacheline and wait for acknowledgements. This usually incurs long latency which may hurt performance. In TARDIS, however, no invalidation happens in this case (Section 3.3) and the exclusive request can immediately return with exclusive ownership. The timestamp manager only needs to make sure that the returned wts is greater than the rts which is the largest timestamp the current line can possibly be loaded with by any core. This makes TARDIS much simpler to implement and reason about, but does cause issues with backward compatibility (cf. Section 3.5).
Eviction: In directory coherence, when a shared cacheline is evicted from the private cache, a message is sent to the directory because the sharer information needs to be tracked there. Similarly, when a cacheline is evicted from the LLC, all the copies in the private memory should be invalidated. Though these invalidation messages may not be on the critical path of execution, they can still generate network traffic. In TARDIS, correctness does not require maintaining sharer information and thus no such invalidations are required. Similarly, when a cacheline is evicted from the LLC, the copies in the private caches can still exist and be accessed.
Data Renewal:
In directory coherence, a load hit only requires the cacheline existing in the private cache. In TARDIS, however, a cacheline in the private cache may have expired and cannot be accessed. In this case, a renew request is sent to the timestamp manager which incurs extra latency and network traffic. If the renewal is successful, then the overhead is a round-trip message with a single header flit (no data). The latency may be hidden by a Out-of-Order (OoO) processor [30] . In Section 6, we show that data renewal is not an issue in practice.
Scalability
A key advantage of TARDIS over directory coherence is scalability. TARDIS only requires the storage of timestamps for each cacheline (O(1)) and the owner ID for each LLC cacheline (O(log(N ), where N is the number of cores). In practice, the same hardware bits can be used for both timestamps and owner ID in the LLC; because when the owner ID needs to be stored, the cacheline is exclusively owned by a core and the timestamp manager does not need to maintain the timestamps.
On the contrary, a directory coherence protocol usually maintains the list of cores sharing a cacheline which requires O(N ) storage overhead. It is certainly possible to store partial sharer information and rely on broadcasting to invalidate sharers (e.g., [16] ), but broadcasting is an expensive operation and does not scale well to high core count.
Simplicity
Another advantage of TARDIS is its conceptual simplicity and elegance. TARDIS is directly derived from the definition of sequential consistency and the timestamps explicitly express the global memory order. This makes it easier to argue the correctness of the protocol. As will be shown in Section 5, TARDIS is compatible with various system designs and optimization techniques, but there are issues with legacy code as we describe below.
Backward Compatibility
Even though TARDIS strictly follows sequential consistency, it may not be compatible with all legacy code. For example, directory coherence guarantees that a write is quickly observed by all the cores; because the write invalidates all the shared copies in private caches, reads from other cores will miss in the private cache and thus load the latest version of the data from the LLC. This mechanism is not the requirement of sequential consistency but some programs rely on it. For example, test, test-and-set and producer-consumer are usually implemented on top of this mechanism where a core spins on a variable which is modified by another core.
In TARDIS, however, a core can still read the old data even though another core has modified it, as long as the cacheline does not expire. If the core spins on a variable in the private cache, the pts does not increase and the cacheline with the old data is always valid. In the pathological case, the core spins forever and the updated variable is never loaded into the private cache, and the result is livelock.
The solution to the livelock problem is, however, straightforward. To be backward compatible, we only need to make sure that an update is eventually observed by the following reads. The pts in each core can be occasionally incremented so that the old version eventually expires. The self increment can be periodic or some heuristics can be used; for example, if performance suffers or livelock is detected, then the self increment should be more frequent.
Compared to directory coherence, the above solution may incur more messages in the network and higher latency since the consumer may send multiple renew requests to the timestamp manager; but this solution is functionally correct. In general, since TARDIS has quite different underlying structure from directory coherence, the optimal communication strategy should also be different. In TARDIS, the best way to implement mutual exclusion or producer-consumer is likely not through local variable spinning. Extra hardware support may be required for more responsive communication in TARDIS. The exploration of these extra mechanisms is deferred to future work.
TARDIS OPTIMIZATION
We introduce some optimizations in the TARDIS protocol in this section.
Speculative Execution
As discussed in Section 3.4, the main disadvantage of TARDIS compared to directory coherence is the renew request. In a pathological case, the pts in a core may rapidly increase since some cachelines are frequently read-write shared by different cores. Meanwhile, the read only cachelines will frequently expire and a large number of renew requests are generated. Renew requests incur both latency and bandwidth overhead. Observe, however, that most renew requests will successfully extend the lease and the renew response does not transfer the data. This significantly reduces the network traffic of renewals. More important, this means that the expired cacheline is actually valid and we could have used the value without stalling the pipeline of the core. Based on this observation, we propose the use of speculation to hide renew latency. When the core reads a cacheline which has expired in the private cache, instead of stalling and waiting for the renew response, the core can use the current value and speculatively execute instructions. If the renew fails and the latest cacheline is returned, the core should rollback to the state prior to the speculation and rerun the instructions. Otherwise, the core can simply commit the instructions executed during the speculation.
For most processors, the outstanding renew request does not cause the pipeline to stall, so successful renews do not hurt performance. Speculation failure does incur large performance overhead since we have to rollback and rerun the instructions speculatively executed. However, if the same instruction sequence is executed in directory coherence, the expired cacheline should not be in the private cache in the first place; the update from another core should have already invalidated this cacheline and a cache miss should happen. As a result, in both TARDIS and directory coherence, the value of the load should be returned at the same time incurring the same latency and network traffic. There is still a small difference between the two protocols handling this case; directory coherence still allows the core to do outof-order execution without the value of the load and some progress is made while TARDIS will rollback all the forward progress and rerun all the instructions. So TARDIS may still perform worse than directory coherence in this case but the performance gap should not be very large.
Timestamp Compression
In the basic TARDIS protocol discussed in Section 3.3, all the cachelines need to store both the rts and wts on the chip. In this section, we show how to compress the timestamp storage such that only one timestamp is stored for each cacheline in the private cache. We will discuss the compression for shared and exclusive cachelines respectively.
For a shared cacheline, the wts can actually be eliminated; the core only needs to check the rts to figure out whether the cacheline has expired or not. In the basic TARDIS protocol, the wts is also used for version checking in the timestamp manager (if two copies have the same wts, they must have the same value). However, just the rts in the private cacheline can also do the version checking with the cacheline in the LLC; if the rts is equal to or greater than the wts of the LLC copy, the private cacheline must have the same value as the cacheline in the LLC.
For an exclusive cacheline, the rts can be eliminated. We only need to keep track of the last modification timestamp. When the cacheline is flushed or written back to the LLC, it is conservative to assume that the last read timestamp is the current pts. The actual rts may be smaller than pts but assuming a higher value does not affect correctness in this case.
Timestamp Rollover
With real hardware, the timestamps in the system may eventually rollover leading to incorrect behavior. The simple and obvious solution is to use wide enough timestamps (e.g., 64 bits) that will never rollover in practice. However, this incurs a significant storage overhead.
A more elegant solution is to use a shorter timestamp counter and handle timestamp rollover with a special mechanism. Specifically, we propose a two-phase timestamp management policy.
The system has two phases. In phase 0, timestamps with MSB = 0 (Most Significant Bit) are ordered before the timestamps with MSB = 1; and vice versa in phase 1. When the system runs in phase 0, it gradually cleans up the timestamps with MSB = 1 by bumping them up to the smallest timestamp with MSB = 0 (i.e., ts = 0...0). This cleanup process can be piggybacked on the existing coherence traffic. For example, at phase 0, for all responses from the LLC, if the rts or wts has MSB=1, it can be rounded up to 0...0. For correctness, before the system transitions to phase 1, all the timestamps should have MSB = 0. The coherence protocol should be able to cleanup most of the timestamps before the next transition. And if not, a global stall may happen to cleanup the rest of the timestamps. In practice, such global stalls should be very rare (if they happen at all) given reasonably large timestamps.
TARDIS EXTENSIBILITY
To this point, we have only discussed the simplest TARDIS protocol in the context of sequential consistency and simple core models. Though the TARDIS protocol is derived from sequential consistency, it is perfectly compatible with other consistency models as well. The simplicity of TARDIS also makes it possible to apply other processor optimizations which do not work well in directory coherence. In this section, we will show how to apply TARDIS to different consistency models and support remote memory access in TARDIS.
Relaxed Consistency
In commercial multicore processors, relaxed consistency models are more commonly used than sequential consistency for performance reasons. For example, Intel x86 [26] and SPARC [31] processors use Total Store Order (TSO) which is weaker than sequential consistency. Even weaker memory models are used in other processors (Alpha [28] , Pow-erPC [7] , ARM [25] , etc.). These consistency models relax certain constraints in sequential consistency and thus provide more freedom to the underlying hardware to reorder memory operations. They insert memory fences when certain memory order needs to be enforced.
In TSO, the Store→Load constraint is relaxed. The first rule of SC only needs to be held if X, Y are both loads or stores, or if X is a load and Y is a store. There is no constraint when X is a store and Y is a load. This relaxation allows a load to bypass the stores in a processor's write buffer and thus exploit more instruction-level parallelism.
In TARDIS, the constraint can be easily expressed using timestamps. Instead of having a single pts, each core can have two timestamps, one for loads (lts) and one for stores (sts). The core guarantees that both the lts and sts are monotonically increasing in the program order. It also guarantees that the sts of a store is larger than the latest lts in the program order (Load→Store). The memory fences in TSO can also be easily implemented. A fence can be considered as a synchronization between sts and lts where the lts bumps up to the latest value of sts. Different from directory coherence, the core does not need to stall at a fence; it only needs to update the timestamps.
Similar to TSO, other consistency models can also be implemented through timestamps. The timestamp management policy can simply follow the definition of the consistency model. And the fences are synchronization points of timestamps.
Remote Word Access
Traditionally, a core always loads a cacheline into the private cache before accessing the data. However, it is also possible to access the data remotely without caching it. Remote word access has been studied in the context of locality-aware directory coherence [15] . Remote atomic operation has been implemented on Tilera processors [11, 8] . Allowing data accesses or computations to happen remotely can reduce the coherence messages and thus improve performance [33] .
However, it is not easy to maintain the performance gain of these remote operations with directory coherence under TSO or sequential consistency. For a remote load operation (which might be part of a remote atomic operation), it is not very easy to determine its global memory order since it is hard to know the physical time when the load operation actually happens. As a result, integration with directory coherence is possible but fairly involved [14] .
In TARDIS, however, memory operations are ordered through timestamps. It is very easy to determine the memory order for a remote access since it is simply the timestamp of the operation. In TARDIS, multiple remote accesses can be issued in parallel and the order can be checked after they return. If 
Norm. Thrput.
DirCC TARDIS w/o spec. TARDIS Figure 3 : Performance normalized to MSI.
any load violates the memory order, it can be reissued with the updated timestamp information.
EVALUATION
In this section, we evaluate the performance of TARDIS in the context of multicore processors. The conclusions derived in this section also apply to other shared memory systems.
Methodology
We use the Graphite [21] multicore simulator for our experiments. The default hardware parameters are listed in Table 6 . We use the simplest directory protocol MSI (DirCC) as the baseline. This is fair since the TARDIS protocol we evaluate here is also the simplest version. More complicated directory schemes require Exclusive(E) or Owned(O) state; both states can also be added to TARDIS. We explore more sophisticated TARDIS protocols in future work.
TARDIS with and without speculative execution is implemented and evaluated. We implemented timestamp compression in the private cache (L1) and assume that the timestamps are long enough and do not rollover for the duration of the simulation. The lease of a cacheline is always 10 regardless of the memory access pattern.
All the experiments in this section follow the sequential consistency memory model. Numerous Splash-2 [32] benchmarks are used for performance evaluation.
Performance Study
The throughput of MSI and TARDIS are shown in Fig. 3 . For TARDIS, we show the performance with speculation turned on and off. For most benchmarks, TARDIS achieves the same or better performance compared to the directory baseline. On average, TARDIS with and without speculation is 3.1% and 2.3% better than the directory coherence, respectively.
For most experiments, TARDIS without speculation does not perform much worse than the baseline directory. This is because the out-of-order cores can execute useful instructions while the renew request is outstanding.
TARDIS with speculation in general performs better than TARDIS without speculation. If the cacheline on demand exists in the private cache, with speculation, the data is immediately ready while without speculation, the data is only ready after the renew response comes back. In some benchmarks (e.g., ocean-contiguous and oceannoncontiguous), TARDIS performs significantly better than MSI. The performance difference mainly comes from the difference in LLC eviction policy. In MSI, the cacheline with the least number of sharers is evicted first; this eviction policy performs better than a simple LRU policy in practice. In TARDIS, we always evict the cacheline with the smallest rts, which is the least recently used cacheline in the global memory order but not the physical time order. This can lead to more efficient LLC eviction.
In raytrace, however, TARDIS performs significantly worse than the baseline directory coherence. Raytrace is implemented based on a task queue model with all the threads frequently communicating with each other, much more so than in other benchmarks. We believe that the synchronization primitives were optimized for directory coherence and not for the TARDIS protocol. As a result, TARDIS can only achieve suboptimal performance.
Network Traffic
We compare the network traffic in TARDIS with directory coherence. Fig. 4a shows the total network traffic normalized to the baseline directory coherence. Fig. 4b shows the ratio of renew requests and miss speculations to the total number of LLC requests in TARDIS with speculation.
For some benchmarks, a large portion of LLC requests are renew requests (e.g., radiosity, volrend). These benchmarks also incur the most network traffic in TARDIS compared to DirCC. However, even if a majority of the network traffic comes from renew requests (more than 60% in radiosity), the increase in total network traffic is quite small (less than 10% for radiosity). As discussed in Section 3.4, successful renewals only require a single flit round trip message without data payload, which is much cheaper than a normal LLC request where the data is included in the response. A failed renewal needs to include the data in the response but the traffic will also be incurred in the baseline DirCC for the same instruction sequence; this is because the cacheline would have been invalidated by another core and result in a private cache miss (Section 4.1). As a result, the traffic overhead of renewals in TARDIS is quite small.
One way to further reduce the renewals in TARDIS is to dynamically determine a cacheline's lease based on the data access pattern. For example, read only cachelines may have a longer lease and frequently read and written shared cachelines may have a shorter lease. We will explore this optimization in future work.
For most benchmarks, TARDIS has less network traffic than DirCC. There are at least two reasons for this. First, as mentioned in Section 6.2, TARDIS has a better LLC replacement policy than DirCC in our implementations. So there is less traffic to and from memory controllers. Second, TARDIS does not send invalidations to sharers for exclusive requests nor evictions. This saves on network traffic for invalidation intensive benchmarks.
RELATED WORK
We discuss some related works on timestamp based coherence (Section 7.1) and scalable directory coherence (Section 7.2).
Timestamp based coherence
To the best of our knowledge, none of the existing timestamp based coherence protocols apply to all consistency models and achieves as good performance as TARDIS. In all of these protocols, the timestamp notion is tightly coupled with physical time. To our knowledge, TARDIS is the first algorithm using logical timestamps that is decoupled from the physical time.
Using timestamps for coherence has been explored in both software [22] and hardware [23] . Recently, TSO-CC [9] proposed a hardware coherence protocol based on timestamps. However, it does not work for sequential consistency and leads to bad performance in certain pathological cases.
In the literature we studied, Library Cache Coherence (LCC) [19] is the closest algorithm to TARDIS. Different from TARDIS, LCC used the physical time as timestamps. In LCC, performance is bad for certain pathological access patterns. For example, a write to a shared variable in LCC needs to wait for all the shared copies to expire which may take a long time. This is much more expensive than TARDIS which only needs to update a counter.
Singh et al. use a variant of LCC on GPUs with optimizations to mitigate the timestamp expiration problem [27] . However, the algorithm only works efficiently for release consistency and is not general to all consistency models in all shared memory systems.
Scalable directory coherence
Some previous works have proposed techniques to make directory coherence more scalable. Limited directory schemes (e.g., [1] ) only track a small number of sharers and rely on broadcasting [16] or invalidations when the number of sharers exceeds a threshold; this either incurs performance overhead or requires broadcasting which is not a scalable mechanism.
Other schemes have proposed to store the sharer information in a chain [4] or hierarchical structures [20] . With large number of sharers, this requires several directory accesses to traverse the list or the tree which hurts performance.
Previous works have also proposed to use coarse vectors [10] , sparse directory [10] , software support [5] or disciplined programs [6] for scalable coherence. Recently, some cache coherence protocols have been proposed for 1000 core processors [13, 24] . These schemes are based on directory coherence and require O(N ) directory storage per cacheline.
CONCLUSION
We proposed a new memory coherence protocol, TARDIS, in this paper. One key observation behind TARDIS is that global memory order only needs to be enforced logically while previous schemes enforce it using physical time. TARDIS can be directly derived from the consistency model and is simpler than directory coherence. Simplicity also makes TARDIS more compatible with different system configurations and optimizations. Important to massive-scale shared memory systems, TARDIS is very scalable as it only requires log(N ) storage per data and no broadcasting is required.
