Cache coherence is an integral part of shared-memory systems but is also widely considered to be one of the most complex parts of such systems. Much prior work has addressed this complexity and the verification techniques to prove the correctness of hardware coherence. Given the new multicore era with increasing number of cores, there is a renewed debate about whether the complexity of hardware coherence has been tamed or whether it should be abandoned in favor of software coherence. This article revisits the complexity of hardware cache coherence by verifying a publicly available, state-of-the-art implementation of the widely used MESI protocol, using the Murϕ model checking tool. To our surprise, we found six bugs in this protocol, most of which were hard to analyze and took several days to fix. To compare the complexity, we also verified the recently proposed DeNovo protocol, which exploits disciplined software programming models. We found three relatively easy to fix bugs in this less mature protocol. After fixing these bugs, our verification experiments showed that, compared to DeNovo, MESI had 15X more reachable states leading to a 20X increase in verification (model checking) time. Although we were eventually successful in verifying the protocols, the tool required making several simplifying assumptions (e.g., two cores, one address). Our results have several implications: (1) they indicate that hardware coherence protocols remain complex; (2) they reinforce the need for protocol designers to embrace formal verification tools to demonstrate correctness of new protocols and extensions; (3) they reinforce the need for formal verification tools that are both scalable and usable by non-expert; and (4) they show that a system based on hardware-software co-design can offer a simpler approach for cache coherence, thus reducing the overall verification effort and allowing verification of more detailed models and protocol extensions that are otherwise limited by computing resources.
INTRODUCTION
Shared memory continues to be one of the most widely used parallel programming models due to the advantages offered by a global address space. Today's multicores directly support shared memory and form the heart of a variety of computing systems, ranging from smartphones to data centers to the largest supercomputers. One of the main challenges of shared memory hardware is to keep caches coherent, and this is usually addressed by hardware cache coherence protocols. Traditionally, a major problem with such protocols has been that they have been extremely complex, and implementing bug-free protocols has been a challenge [Abts et al. 2003 ]. Increasing on-chip core counts in manycores exacerbates this problem because it demands moving to variants of directory-based protocols. Methods like trace-driven or other simulationbased testing are insufficient for ensuring the correctness of such complex protocols since they often do not hit the difficult corner cases. Formal verification techniques have therefore been proposed and deployed to prove the correctness of these protocols [Abts et al. 2000; Burckhardt et al. 2005; Gjessing et al. 1989; McMillan and J. 1991; Nanda and Bhuyan 1992; Park and Dill 1995; Pong et al. 1998 ].
Over the past several decades, there has been much experience accumulated in both designing cache coherence protocols and building formal methods based tools to verify them. Nevertheless, as the number of cores on a chip grows, the power, performance, and area inefficiencies of hardware cache coherence (e.g., directory storage, indirection through the directory, invalidation and acknowledgment messages, false sharing, cache line based granularities, and so on) are potential obstacles to scalability. To reduce these inefficiencies, researchers continue to propose enhancements to basic coherence protocols-these enhancements typically come at the cost of even further complexity [Zebchuk et al. 2009; Zhao et al. 2013] .
There is a debate in the community about whether we have tamed the complexity of hardware cache coherence enough to continue building increasingly larger hardware cache coherent systems or whether we must abandon hardware coherence in favor of other alternatives. Thus, on the one hand, companies like Tilera have built hardware cache coherent systems with as many as 64 cores, making the resulting familiar programming model a key differentiator for their product line. A recent position paper makes the case that hardware coherence is here to stay [Martin et al. 2012] . Other researchers continue to extend current protocols in new ways that make them even more complex [Aisopos and Peh 2011] . On the other hand, Intel has recently built a 48-core Single-Chip Cloud Computer (SCC) [IntelSCC 2009 ] that abandons hardware cache coherence, replacing it with software managed caches and their attendant programming complexity. Academic and industry researchers have also proposed systems that eliminate dependence on hardware coherence in various ways [Borkar 2011; Choi et al. 2011; Kelm et al. 2009; Lin et al. 2012] .
In this article, we revisit the complexity of hardware cache coherence by asking the following questions. Has our experience with hardware cache coherence in the past few decades enabled us to conquer its complexity? If not, what are the implications for future system design?
To answer these questions, ideally, we would use an industry-strength cache coherence protocol specification and quantitative metrics for its complexity. Given the proprietary nature of industry protocols, we instead use a state-of-the-art, mature, publicly available protocol, the MESI protocol implemented in the Wisconsin GEMS simulation suite (version 2.1.1) .
1 GEMS was released in 2005 and remains a widely used simulation package in the computer architecture community. Unfortunately, however, there are no currently accepted quantitative metrics to measure complexity. Instead, we report on our experiences verifying the GEMS MESI protocol using the Murϕ model checking tool (version 3.1, slightly modified to exploit 64-bit machines) [Dill et al. 1992; Dill 1993, 1996] . Although more advanced verification techniques exist, we chose Murϕ for its easy-to-use interface and robustness. Murϕ has also been the tool of choice for many hardware cache-related studies [Abts et al. 2000; Burckhardt et al. 2005; Park and Dill 1995; Pong et al. 1998; Zhang et al. 2010] .
Our results from applying Murϕ to the GEMS MESI protocol were surprising. We found six bugs in the protocol, even though it is a mature protocol used by a large number of architecture researchers. More significantly, some of these bugs involved subtle races and took several days to debug and fix. We contacted the developers of the GEMS simulation team with our bug findings in 2011. They had seen one of the six bugs, bug #4 in Section 4.1, but were surprised by the other bugs. Some of these bugs were also present in the GEM5 simulator , an extension to the GEMS simulator that incorporates the M5 CPU core simulator, at that time. After we showed our fixes, the GEMS group fixed the bugs and released new patches. These fixes needed the addition of multiple new state transitions and extra buffer space for stalling requests in the protocol. After fixing the protocol, we were able to successfully run Murϕ, but the tool required making several simplifying assumptions typical of such tools today (e.g., two cores, single cache line per core, and one address). With these assumptions, the verification process explored 1,257,500 reachable states and took 173 seconds using the default breadth-first search option of Murϕ. We ran out of memory on our machine (with 32GB of RAM) when we tried to model a system containing two memory addresses.
Our conclusion from this experience is that hardware cache coherence protocols still remain difficult to design, understand, and prove correct. Although formal verification techniques have been embraced by industry, their use is not yet common practice for architecture researchers. Our results reinforce the need to use formal verification tools to assess the correctness and complexity of coherence protocols. Conversely, our work also motivates the need for verification tools that are scalable to larger systems and usable by non-expert.
An extreme solution to the hardware complexity problem is represented by the Intel SCC [IntelSCC 2009 ]. The SCC eliminates the hardware complexity of cache coherence but imposes significant programming complexity due to the requirement of managing the caches in software. Recently, we proposed an alternate method, called DeNovo, that is based on a tighter hardware/software co-design approach [Choi et al. 2011] . For easier programmability, this approach espouses disciplined shared-memory programming languages that eliminate data races by design, provide determinism-by-default, and provide safe nondeterminism only when explicitly requested. With a guarantee of datarace-freedom from software, DeNovo is able to avoid races in hardware and the resulting transient states and their complexity. The simplicity of the protocol, coupled with the disciplined software, provides performance and power advantages to DeNovo as well; furthermore, various enhancements (optimizations) can be incorporated without adding new states or transitions [Choi et al. 2011] . We report our experiences with using Murϕ to verify the DeNovo protocol and compare them with the MESI experiences.
To verify DeNovo, we first extended the canonical coherence protocol modeling technique in Murϕ to model the software constraints that DeNovo hardware expects. In 1 We chose MESI over a more complicated MOESI from GEMS because the GEMS web site says that the CMP version of the latter has bugs. Although we specifically study the MESI protocol in this article, we believe our insights apply to other protocols, such as MOESI, as well. particular, this extension is able to guarantee the data-race-free property and a disciplined parallel phase behavior. Using this extension, we verified the base DeNovo protocol. Despite DeNovo's immaturity, we found only three bugs in the implementation. Furthermore, these bugs were simple to fix and turned out to be mistakes in translating the high level description of the protocol into the implementation (i.e., their solutions were already present in the internal high level description of the protocol). After fixing the bugs, we found that, compared to MESI, Murϕ had to explore 15X fewer reachable states for DeNovo, taking 20X less time to fully model check DeNovo. It is difficult to define a metric to quantify the relative verification complexity of coherence protocols; nevertheless, our results demonstrate that hardware-software co-designed approaches like DeNovo can lead to much simpler protocols than conventional hardware cache coherence (while providing an easy programming model, extensibility, and competitive or better performance [Choi et al. 2011] ).
2 Overall, this article shows that despite the accumulated experience in designing hardware cache coherence, these protocols remain complex. It reinforces the need to use formal verification by protocol designers, motivating scalable and easy-to-use tools for non-experts. Finally, the article motivates hardware-software co-designed protocols that are much simpler to design and verify than pure hardware coherence and that exploit simpler programming models.
COMPLEXITY OF CACHE COHERENCE
Cache coherence is usually defined as a mechanism that ensures writes to the same memory location appear in the same order to all cores. In this section, we discuss two directory based protocols, the MESI protocol, which is based on a state-of-the-art implementation obtained from the publicly available GEMS simulator , and the recent DeNovo protocol [Choi et al. 2011] , which is based on our own implementation. Without loss of generality, we assume a multicore system with n cores, private L1 caches, a shared L2 cache, and a general (nonbus, unordered) interconnect on chip.
MESI
MESI, also known as the Illinois protocol [Papamarcos and Patel 1984] , stands for Modi f ied (locally modified and no other cache has a copy), Exclusive (unmodified and no other cache has a copy), Shared (unmodified and some other caches may have a copy), and Invalid. Over the MSI protocol, the Exclusive state has the added advantage of avoiding invalidation traffic on write hits. For scalability, we assume a directory protocol [Lenoski et al. 1992] . Given our shared (inclusive) L2 cache-based multicore, we assume a directory entry per L2 cache line, referred to as an in-cache directory [Censier and Feautrier 1978] . We use L2 and directory interchangeably. Figure 1 shows the simple textbook state transition diagram for an L1 cache with the MESI protocol. The L2 cache also has four (textbook) states: L1 Modi f ied (modified in a local L1), L2 Modi f ied (modified at L2 and not present in any L1), Shared (valid data at L2 and present in one or more L1s), and Invalid. It also has a dirty bit (set on receiving a writeback from L1) that indicates whether data is dirty or not. When in Shared state, the L2/directory contains the up-to-date copy of the cache block data along with a list of sharers. On a read miss request, the directory services the request 2 The paper proposing DeNovo [Choi et al. 2011] showed that the basic DeNovo protocol is performancecompetitive with MESI, and its simplicity enables optimizations that significantly improved performance and network traffic (and thereby energy) [Choi et al. 2011] . That paper focused on performance, but as evidence of DeNovo's simplicity, it also briefly summarized some of the key results of this article. That paper only gave the number of bugs found in each protocol, the number of states explored, and the verification times. It did not describe the nature of the bugs in MESI or DeNovo, the fixes involved, or the modification to standard Murϕ modeling required to verify a hardware/software co-designed protocol like DeNovo. if it has the up-to-date copy, or else it forward the request to the core that has the exclusive or modified copy. On a write miss or upgrade request, the directory sends invalidation requests to all the sharers (if any). If a request misses in the L2, the block is fetched from the main memory.
In reality, this seemingly simple protocol is a lot more complex. The hardware implementation of the protocol has many transient states in addition to the four states just described. These transient states lead to various subtle races and are the root cause of complexity in the protocol. We now illustrate the need for transient states with an example. Figure 2 (a) shows a code snippet with two parallel phases accessing a shared variable A.
3
In the first phase, cores P2 through Pn read the shared variable A, and, in the second phase, core P1 writes to A. In this example, we focus mainly on the state transitions related to this write by core P1.
Figure 2(b) shows the timeline of the state transitions at both the individual L1s and the L2. Figure 2 (c) and (d) show the state transition table for L1 and L2, respectively, for the states encountered in this example. The names of the states and the events are taken directly from the GEMS implementation. At the beginning of the second phase, cores P2 through Pn are in Shared state and they are recorded in the sharer list at the directory. On receiving the write request, L1 P1 issues a GETX request to L2 and transitions to the first transient state, I M, where it awaits the data response from L2. L2, on receiving the GETX request, sends the data response (including the number of Acks to expect) to L1 P1 ; sends invalidation requests to all the sharers (L1 P2 through L1 Pn ); and then transitions to a transient state, SS MB, where it awaits an unblock message from the original requestor indicating the completion of the request. L1 P1 , on receiving the data response from L2, transitions to the second transient state, SM, where it waits for all the Ack messages from the sharers. Every sharer, then, on receiving the invalidation message from L2, transitions to the Invalid state and responds directly to the requestor, L1 P1 , with an Ack message. When L1 P1 receives the last Ack message (Ack all event), it transitions to the Modi f ied state and unblocks L2 by sending an Exclusive U nblock message. Other cases (e.g., some of the Acks arriving at L1 P1 before it receives the data response from L2, etc.) are covered in the state 37:6 R. Komuravelli et al. This example illustrates the need for transient states and the additional complexities introduced by them in the MESI protocol. The larger the number of transient states, the more complex the protocol becomes. In the GEMS implementation of the MESI protocol, there are 7 transient states in L1 and 14 transient states in L2. Optimizations to the described baseline protocol usually incur additional transient states.
DeNovo
One of the main difficulties in writing and debugging parallel programs is data races. These software data races are also responsible for races in hardware with conventional hardware cache coherence protocols. These hardware races are directly responsible for most transient states in hardware protocols, and they form the root cause of much of the complexity in such protocols.
There has been a lot of software work on disciplined shared-memory programming models [Anderson et al. 2008; Gustafsson 2009; Berger et al. 2009; Blumofe et al. 1995; Budimlic et al. 2009; Ghuloum et al. 2007; Bocchino et al. , 2009 Olszewski et al. 2009; Kulkarni et al. 2007; Allen et al. 2009 ] to eliminate or alleviate the impact of data races, with the goal of providing better programmability, debuggability, and concurrency safety of parallel programs. Along with addressing the data race problem, much of this work also provides strong safety properties, such as determinism and various disciplined forms of nondeterminism, to simplify parallel programming. DeNovo is a hardware architecture that exploits the properties of such disciplined software programming models to additionally simplify hardware while also providing performance, power, and scalability benefits [Choi et al. 2011] .
2.2.1. Software Model. DeNovo currently uses Deterministic Parallel Java (DPJ) [Bocchino et al. , 2009 as an example of a disciplined programming language to drive its design. DPJ is an extension to Java that provides properties like structured parallel control, data-race-freedom, and explicit specification of the effects of shared memory accesses. In a DPJ program, the programmer partitions the heap into named "regions" and annotates methods with summaries of read and write operations on these regions ("effects"). The DPJ compiler uses this region and effect information to guarantee determinism.
This article does not quantify software complexity; however, our software philosophy and DPJ are motivated entirely by the goal of reducing software complexity. Even today, the C++ [Boehm and Adve 2008] and Java [Manson et al. 2005 ] memory models do not provide any reasonable semantics for data races; therefore, a data race in these programs is a bug and imposes significant verification complexity. In contrast, DPJ provides strong safety guarantees of data-race-freedom and determinism-by-default. Programmers can reason about deterministic programs as if they were sequential. There is certainly an additional up-front burden of writing region and effect annotations in DPJ; however, arguably, this burden is mitigated by the lower debugging and testing time afforded by deterministic-by-default semantics. There is also ongoing work on partly automating the insertion of these annotations [Vakilian et al. 2009] . Overall, there is a growing part of the community that is working toward similar goals of more disciplined programming practices [WoDET 2013] .
In this article, we focus on the baseline DeNovo protocol [Choi et al. 2011] targeted to deterministic codes based on the original DPJ language proposed in Bocchino et al. [2009] . DPJ and DeNovo are ongoing projects that continue to incorporate support for increasingly wider classes of programs Heumann et al. 2013; Sung et al. 2013 Sung et al. , 2014 . We discuss this ongoing work and implications for general codes in Section 5.
Hardware. DeNovo exploits disciplined language features for a more efficient, scalable, and simpler hardware. Specifically, structured parallel control and explicit read and write effect information enable caches to self-invalidate stale data (communicated through a compiler-inserted self-invalidate instruction for each region with write effects) and thus remove the need to maintain sharer lists and to send invalidation and acknowledgment messages. In addition, the data-race-freedom guarantee makes it possible to completely eliminate transient states from the cache coherence protocol and thus greatly reduce its complexity. As shown in [Choi et al. 2011 ], these benefits do not compromise the performance of the DeNovo protocol; instead, they provide opportunities for improved performance and energy without increasing complexity.
The DeNovo protocol has three states at L1, Registered (locally modified), V alid, and Invalid. L2 has the same states, too, with obvious meaning. The handling of read requests is straightforward. On a write, if the line is not registered, L1 issues a registration request to L2 and transitions to Registered state immediately because there are no races in the program. Similar to MESI, DeNovo also has a dirty bit at L2, which is set to indicate dirty data. The complete state transition table and a detailed description of the DeNovo protocol can be found in Choi et al. [2011] .
Let us revisit the code segment from Figure 2 . Figure 3(a) shows the changes to the code required to prove data-race-freedom. Specifically, the shared variable A is placed in a region R A , both the parallel phases are annotated with read and write effect summaries, and, finally, a self-invalidation instruction is inserted at the end of the second phase. transitions directly to the Registered state without transitioning to any transient state and sends a registration request to L2. L2, on receiving the registration request, transitions to the Registered state. We do not show the registration response message from L2 here because it is not in the critical path and is handled by the request buffer at L1. As explained in Choi et al. [2011] , the functionality of the request buffer is similar to that of a uniprocessor system and is invisible to the outside world. At the end of the phase, each core executes a self-invalidate instruction on region R A . This instruction triggers the invalidation of all the data in region R A in the L1 cache of its core (except for data in Registered state since this data is known to be up-to-date).
As an optimization, the L1 maintains an additional touched bit for each word. This bit is set on a read, indicating that this data will stay up-to-date until the end of the phase. This is guaranteed by data-race-freedom because there cannot be concurrent writers to the same location. The self-invalidation operation now skips invalidating the data that is both V alid and touched. These touched bits are reset at the end of the parallel phase.
This example illustrates how the absence of transient states makes the DeNovo protocol simpler than MESI.
MODELING FOR PROTOCOL VERIFICATION
We use the Murϕ model checking tool [Dill et al. 1992; Dill 1993, 1996 ] to verify the simple word-based protocols (equal address, communication and coherence granularity as explained in Choi et al. [2011] ) of DeNovo and MESI. We derived the MESI model from the GEMS implementation . We derived the DeNovo model from our implementation. To keep the number of states explored (by Murϕ) tractable, as is common practice, we used a single address, single region (only for DeNovo), two data values, and two cores. We modeled private L1 caches, a unified L2, an in-cache directory (for MESI), and an unordered full network with separate request and reply links. Both models allow only one request per L1 in the rest of the memory hierarchy. Because we modeled only one address, we modeled replacements as unconditional events that can be triggered at any time. To enable interactions across multiple parallel phases (cross-phase) in both the models, we introduced the notion of a phase boundary by modeling it as a sense-reversing barrier. Finally, we modeled the data-race-free guarantee for DeNovo by limiting conflicting accesses. We explain each of these attributes in detail in the following sections.
Abstract Model
To reduce the amount of time and memory used in verification, we modeled the processors, addresses, data values, and regions as scalarsets [Ip and Dill 1996] , a datatype in Murϕ that takes advantage of the symmetry in these entitites while exploring the reachable states. A processor is modeled as an array of cache entries consisting of L1 state information along with protocol-specific fields like the region field and the touched bit for DeNovo. L1 state is one of three possible states for DeNovo or one of 11 possible states for MESI. Similarly, L2 is also modeled as an array of cache entries, each with L2 state information, a dirty bit, and other protocol-specific details like sharer lists for MESI. L2 state is one of three possible states for DeNovo or one of 18 possible states for MESI. Memory is modeled as an array of addresses storing data values.
3.1.1. Data-Race-Free Guarantee for DeNovo. To model the data-race-free guarantee from software for DeNovo, we used an additional data structure called AccessStatus. As shown in Figure 4 , this data structure maintains the current status (read, readshared, or written) and the core id of the last requestor for every address in the model. The current status and the last requestor determine the reads and writes that cannot occur in a data-race-free program and are thus disallowed in the model.
On any read, if it is the first access to this address in this phase, then status is set to read. If status is already set to read and the requesting core is not the same as the last requestor, then status is set to readshared. If status is readshared, then it stays the same on the read. If status is written and the requesting core is the same as the last requestor, it stays as written. On the other hand, if the requesting core is not the same as the last requestor, then this access is not generated in the model since it violates the data-race-freedom guarantee.
Similarly, on any write, if it is the first access to this address or if the requesting core is the same as the last requestor, then status is set to write. If status is either readshared or the requesting core is not the same as the last requestor, then this access is not generated to adhere to the data-race-free guarantee.
The AccessStatus data structure is reset for all the addresses at the end of a phase.
3.1.2. Cross-phase Interactions. We modeled the end of a parallel phase (and the start of the next phase) using a sense-reversing barrier implementation [Mellor-Crummey and Scott 1991] . This event (end-of-phase) can be triggered at any time; that is, with no condition. The occurrence of end-of-phase is captured by a flag, release f lag. This event occurs per core and stalls the core from issuing any more memory requests until (1) all the pending requests of this core are completed (i.e., the L1 request buffer is empty), and (2) all other cores reach the barrier. The completion of end-of-phase is indicated by resetting the release f lag flag. Figure 5 shows the Murϕ code for end-ofphase implementation for the DeNovo protocol. The spinwaiting flag indicates that the current core is waiting for other cores to reach the barrier. When a core enters the barrier for the first time, the local sense of the barrier (localsense) is reversed, thus indicating entering a new barrier; barrier count (barcount) is updated; and the spinwaiting flag is set. If it is the last one to enter the barrier, the core also notifies all other cores about the end of barrier by assigning barrier its localsense. It also resets the barcount and release f lag. Once a core reaches the barrier, we also modeled self-invalidations and unsetting touched bits for DeNovo. The code for MESI is similar except for DeNovo-specific operations like self-invalidation and unsetting touched bits.
Invariants
This section discusses the invariants we checked to verify the MESI and DeNovo protocols. The MESI invariants are based on prior work in verification of cache coherence protocols [Dill et al. 1992; McMillan and J. 1991] . The DeNovo invariants are analogous as further described later. (Adding more invariants does not affect the verification time appreciably because the number of system states explored is still the same.) 3.2.1. MESI Invariants. We used five invariants to verify the MESI protocol [Dill et al. 1992 and dirty bit is not set, L2's data value should be the same as at memory.
DeNovo Invariants.
We modeled six invariants for the DeNovo protocol. Because there is no sharer list maintained in the DeNovo protocol, we do not check for the first two invariants of the MESI protocol. The first three invariants of the DeNovo protocol are similar to the last three invariants of the MESI protocol. The last three invariants of the DeNovo protocol are checks on the touched bit functionality. 
RESULTS
Through model checking, we found six bugs (including two deadlock scenarios) in MESI and three bugs in DeNovo. 5 A deadlock occurs when all the entities in the system (all L1s and L2) stop making any forward progress. Murϕ checks for deadlock by default. In MESI, all the bugs except one of the deadlocks were caused by protocol races between L1 writebacks and other cache events. The other deadlock was caused by incorrect handling of clean replacement at the L2.
Most of these bugs found in MESI involved subtle data races and took several days to track, debug, and fix. All three bugs found in DeNovo were simple to fix and showed mistakes in translating our internal high level specification into the implementation (i.e., their solutions were already present in our internal high level description of the protocol).
Each of the bugs found in MESI and DeNovo is described in detail next. In all these descriptions, we consider a single address. L1 P1 , L1 P2 , and L2 indicate the cache lines corresponding to this address in core P1, core P2, and L2, respectively. As mentioned in Section 2, we assume an in-cache directory at L2, and hence we use the words directory and L2 interchangeably.
MESI Bugs
We first discuss the six bugs found in the MESI protocol. We list them in decreasing order of complexity and the amount of change to the code required to fix them. 6 Bug 1. The first bug is caused by a race between an L1 writeback and a write request by some other L1. Figure 6 shows the events that lead to this bug. Let us assume that initially L1 P1 is in Modi f ied state, L1 P2 is in Invalid state, and L2 records that the cache entry is modified in L1 P1 . Then L1 P1 issues a replacement (event 1 in Figure 6 ) triggering a writeback (PUTX) and transitions to a transient state waiting for an acknowledgment to this writeback request. Meanwhile, L1 P2 issues a write request (event 2) triggering GETX to L2. L2 first receives GETX from L1 P2 (event 3). It forwards the request to L1 P1 and waits for an acknowledgment from L1 P2 . L1 P1 , on receiving the GETX request (event 4), forwards the data to L1 P2 and transitions to Invalid state. Then L1 P2 , on receiving the data from L1 P1 (event 5) transitions to Modi f ied state and unblocks the directory, which in turn records that the cache entry is now modified in L1 P2 . But the writeback (PUTX) sent by L1 P1 is still in the network and it can reach the directory at any time because we have an unordered network (event 7), causing an error. For example, suppose L1 P1 later services a write request invalidating L1 P2 and the directory is appropriately updated (not shown in the figure) . L1 P1 's writeback (PUTX) then reaches the directory, which is clearly an error. The bug was found when the writeback acknowledgment from L2 reached L1 P1, triggering a "missing transition" failure (L1 P1 does not expect a writeback acknowledgment in Modi f ied state).
We solved this problem by not transitioning L1 P1 to Invalid state on receiving L1 P2 's GETX request. L1 P1 now sends DAT A to L1 P2 as before, but continues to stay in the transient state, M_I. The write request from L1 P1 , which triggered the bug in the previous example, is now kept pending because L1 P1 is in a transient state. We also added a transition at the L2 to send a writeback acknowledgment when the requester is not the owner in the directory's record. L1 P1 transitions to Invalid state on receiving the writeback acknowledgment from L2. With this, there is no longer a dangling PUTX in the network and the problem is solved. The trace for this bug involved multiple writes to the same memory location in a parallel phase. This scenario does not arise in DeNovo because the software guarantees data-race-freedom.
Bug 2. The second bug is similar to the first except that it is caused by a race between an L1 writeback and a read request by some other L1.
The first two bugs were the most complex to understand and fix. Most of the time was spent in discovering the root cause of the bugs and developing a solution in an already complex protocol. The solutions to these bugs required adding two new cache events and eight new transitions to the protocol.
Bug 3. The third bug is caused by an unhandled protocol race between L2 and L1 replacements. To begin with, L1 P1 is in Exclusive state and L2 records that P1 is the exclusive owner. Then, both L2 and L1 replace the lines simultaneously, triggering invalidation and writeback messages, respectively. L1 P1 , on receiving the invalidation message, transitions to Invalid state and sends its data to L2. On receiving this data, L2 completes the rest of the steps for the replacement. In the end, both L1 and L2 have transitioned to Invalid states, but the initial writeback message from L1 is still in the network and this is incorrect. The bug was found when the writeback acknowledgment (issued by L2 on receiving the dangling writeback message) reaches L1 P1 when it is not expecting one and hence triggers a "missing transition" error.
This bug can be fixed by not sending the data when L1 receives an invalidation message and by treating the invalidation message itself as the acknowledgment for L1's earlier writeback message. Also, the L1 writeback message is treated as the data response for the invalidation message at L2. The fix required adding four new transitions to the protocol.
Bug 4. The fourth bug results in a deadlock situation. It is caused by an incorrectly handled protocol race between an Exclusive unblock (response sent to unblock L2 on receiving an exclusive access) and an L1 writeback issued by the same L1 (issued after sending Exclusive unblock). Initially, L2 is waiting for an Exclusive unblock in a transient state transitioned from Invalid state. In this transient state, when L2 receives an L1 writeback, it checks whether this writeback came from the current owner or not. The owner information is updated at L2 on receiving the Exclusive unblock message. Here, L1 writeback (racing with Exclusive unblock from the same L1) reached L2 first and L2 incorrectly discarded the L1 writeback because the owner information at L2 did not match the sender of the L1 writeback. This incorrect discarding of the L1 writeback results in a deadlock.
This bug can be fixed by holding the L1 writeback to be serviced until Exclusive unblock is received by L2. This requires adding a new transition and additional buffering to hold the stalled request to the protocol.
Bug 5. The fifth bug is similar to the fourth (race between Exclusive unblock and L1 writeback), but instead L2 is initially in Shared state. The fix for this bug required adding two new transitions and additional buffering to hold the stalled requests to the protocol.
Bug 6. The last bug results in a deadlock scenario due to an incorrect transition by L2 on a clean replacement. It transitions to a transient state awaiting an acknowledgment from memory even though the transition did not trigger any writeback. The fix was simple and required transitioning to Invalid state instead.
DeNovo Bugs
We next discuss the three bugs found in the DeNovo protocol. The first bug is a performance bug and the last two are correctness bugs, both of which are caused by races related to writebacks. Bug 1. The first of the three bugs found was caused by not unsetting the dirty bit on replacement of a dirty L2 cache line. Assume that L2 is initially in V alid state and the dirty bit is set to true. Then, on L2 replacement, it transitions to Invalid state and writes back data to memory. But the dirty bit is mistakenly not unset. This bug was found when Murϕ tried to replace the line in Invalid state as the dirty bit was set to true (the model triggers a replacement event by only checking the dirty bit). The model, legitimately, did not have an action specified for a replacement event in the Invalid state, thus resulting in a "missing transition" error. However, the actual implementation did have an action (incorrectly) that triggered unnecessary writebacks to memory, which should be silent replacements instead. This turned out to be a rare case to hit in the simulation runs.
Bug 2. This occurs because an L2 initiated writeback and future requests to the same cache line are not serialized. Initially, L1 P1 is in Registered state and L2 knows P1 as the registrant. On replacing the line, L2 sends a writeback request to L1. L1 replies to this writeback request by sending the data to L2 and transitions to V alid state. Then, on receiving the writeback from L1, L2 sends an acknowledgment to L1 and, in parallel, sends a writeback to memory and waits for an acknowledgment. Meanwhile, let us assume that L1 issued a registration request (on receiving a store request) and successfully registers itself with L2. At this point, yet another L2 replacement was triggered, finally leading to multiple writebacks to memory in flight. This is incorrect because the writebacks can be serviced out of order. Murϕ found this bug when an assertion failed inside the implementation of L2's request buffer.
The real source of this bug is allowing L1 registration to be serviced at L2 while a writeback to memory is pending. The fix involves serializing requests to the same location at L2-in this case, the L1 registration request behind the writeback to memory. This was already present in our high level specification but was missed in the actual protocol implementation. It did not involve adding any new states or transitions to the protocol.
Bug 3. The last bug is due to a protocol race where both the L1s and the L2 replace the line. This bug involves both cores and cross-phase interactions. The events that lead to the bug are shown in Figure 7 . At the beginning of the phase, let us assume that L1 P1 is in Invalid state and L1 P2 is in Registered state (from the previous phase). L1 P2 replaces the line (event 1 in Figure 7 ) and issues a writeback (PUTX) to L2. While this writeback is in flight, L1 P1 successfully registers itself with L2 (events 2-4) (L2 redirects the request to L1 P2 because it is the current registrant). This is followed by a replacement by L1 P1 (event 5), thus triggering another writeback (PUTX) to L2. L2 first receives the writeback from L1 P1 (event 6) and responds by sending an acknowledgment and transitioning to V alid state while setting the dirty bit to true. Now, L2 also replaces the line (event 7) transitioning to Invalid state and issues a writeback to memory. But the writeback from L1 P2 is still in flight. This writeback now reaches L2 (event 8) while in Invalid state (because we model an unordered network). The implementation did not handle this case and resulted in a "missing transition" failure. This bug can be easily fixed by adding a transition to send an acknowledgment to L1 P2 's writeback without the need for triggering any actions at L2.
Analysis
The bugs just described for both MESI and DeNovo show that cache line replacements and writebacks, when interacting with other cache events, cause subtle races and add to the complexity of cache coherence protocols. Fixes to bugs in the MESI protocol needed adding new events and several new transitions. On the other hand, fixing bugs in the DeNovo protocol was relatively easy since it lacks transient states even for races related to writebacks. 
Verification Time
After fixing all the bugs, we ran the models for both MESI and DeNovo on Murϕ as described in Section 3. The model for MESI explores 1,257,500 states in 173 seconds, whereas the model for DeNovo explores 85,012 states in 8.66 seconds. These are the number of distinct system states exhaustively explored by the model checking tool. The state space and runtime both grow significantly when we increase the parameters in the verification model. For example, when we modeled two addresses, we were able to finish running DeNovo without any bugs being reported, but we ran out of system memory (32GB) for MESI. This indicates (1) the simplicity and reduced verification overhead for DeNovo compared to MESI, and (2) the need for more scalable tools amenable to non-expert to deal with more conventional hardware coherence protocols in a more comprehensive way.
RELATED WORK
Hardware cache coherence protocols are inherently complex. There is an ongoing debate about whether we have tamed the complexity of hardware coherence protocols or whether we should abandon them and replace them with their software counterparts. While some predict that hardware cache coherence is here to stay [Martin et al. 2012] and continue to extend such protocols in newer, more complex ways [Aisopos and Peh 2011; Zhao et al. 2013] , others propose and build systems with software-directed coherence [Borkar 2011; Choi et al. 2011; Howard et al. 2010; IntelSCC 2009; Kelm et al. 2009] . Our work offers an objective case study to show that hardware coherence remains quite complex and that a hardware-software co-design approach can provide a simpler alternative. This section first discusses several existing techniques to verify hardware coherence protocols (Section 5.1). Next, we discuss related work on redesigning conventional hardware coherence protocols specifically to address verification complexity (Section 5.2).
Verification Techniques
Hardware coherence protocols have numerous transient states and hard-to-cover race conditions making it very difficult to find all the bugs using just simulations or random testing. Hence, formal methods like model checking are often employed to verify their correctness. Model checking is a technique to verify the properties of a system by exhaustive exploration of the state space [Clarke and Emerson 1982; Queille and Sifakis 1982] . McMillan and Schwalbe's seminal work on model checking the Encore Gigamax protocol [McMillan and Schwalbe 1991] was the first to apply model checking to verify cache coherence protocols.
Complex systems often exhibit a lot of regularity and symmetry. Ip and Dill developed Murϕ [Dill et al. 1992; Dill 1993, 1996] , which exploits these characteristics by grouping together similar states to verify a reduced state graph instead of the full one. This helps to greatly reduce the amount of time and memory used in verification. Murϕ is a widely used tool to formally verify cache coherence protocols; for example, Sun RMO memory model [Park and Dill 1995] , Sun S3.mp multiprocessor [Pong et al. 1998 ], Cray SV2 protocol [Abts et al. 2000] , and Token coherence protocol [Burckhardt et al. 2005] .
In this article, we use Murϕ for our protocol verification work. Using model checking tools like Murϕ for verifying cache coherence protocols is not new per se. This article, nevertheless, does discuss some extensions to the canonical coherence protocol modeling technique to model the guarantees provided by disciplined programming languages (namely, data-race-freedom and a disciplined parallel phase behavior). The main contribution of this work is to provide a detailed experience of the verification process of a state-of-the-art, publicly available, mature, modern hardware coherence protocol implementation (MESI). We further compare this experience with that of verifying a protocol driven by a hardware-software co-design approach (DeNovo), motivated by simplifying both software and hardware through the same mechanisms (while providing opportunities for performance and energy improvements).
Explicit state exploration model checking tools traditionally have the problem of state space explosion limiting the scalability of such tools. As mentioned in Section 4.4, we ran out of system memory when we increased the verification parameters. We tried a distributed model checker based on Murϕ, Preach [Bingham et al. 2010 ] (similar to Eddy Murphi [Melatti et al. 2009] ). But PReach did not help us reduce the number of states explored, and hence the memory footprint stayed the same.
There are other verification techniques that do not have this state explosion problem and can scale to larger systems. Parametric verification [Chou et al. 2004; McMillan 2001; O'Leary et al. 2009 ] and theorem proving [Seungjoon and David 1996] are two such techniques. Several of these techniques are sometimes combined together to verify a given system. For example, the verification of Token Coherence [Burckhardt et al. 2005 ] is achieved by combining assume-guarantee reasoning and structural induction (in addition to model checking). Even though the techniques employed were able to verify the protocol, in practice, these techniques are either hard for nonspecialists to use or error-prone because of laborious manual intervention [Zhang et al. 2014] . There have been proposals to automate parametric verification techniques and minimize manual intervention [Chou et al. 2004 ], but such techniques impose severe limitations on the protocols that can be verified. DeNovo, in contrast, is a simpler protocol and makes it feasible to verify with easy-to-use verification techniques, such as an explicit state model checker. A general survey of various techniques used to verify cache coherence protocols can be found in Pong and Dubois [1997] .
Redesigning Hardware Coherence Protocols
There has been some work on redesigning the hardware cache coherence protocol itself to address the verification problem. A recent proposal, PVCoherence [Zhang et al. 2014] , lists various guidelines for designing cache coherence protocols so that they can be verified using existing automatic parametric verification techniques. The verification technique used in the paper, Simple-PV, uses an automatic tool to generate the parametric model and then Murϕ to verify the generated model. When the proposed guidelines were applied to the MOESI protocol to make it amenable to parametric verification (in the process also making it even more complex), the authors noticed that the resulting protocol could not be verified by Murϕ and needed even more changes. The final protocol was verifiable but showed performance degradation (e.g., average 5% and up to 13.8% increase in network traffic). In Fractal Coherence [Zhang et al. 2010; Voskuilen and Vijaykumar 2014] , the verification for arbitrary number of cores is made possible by just verifying the minimum system for correctness and verifying that the whole system has fractal behavior. But Fractal Coherence requires addition of states and messages to an existing cache coherence protocol (MOSI, as described in Zhang et al. [2010] ) to maintain fractal behavior. Thus, it enables verification of the entire system at the expense of increasing the complexity of the base protocol. Also, a specific implementation of Fractal Coherence, TreeFractal, shows a performance degradation (>10%) when compared to traditional protocols. FlatFractal [Voskuilen and Vijaykumar 2014] proposes techniques to eliminate some of the limitations of TreeFractal, making FlatFractal's performance comparable to the base implementation on 16 cores. But when scaled to 32 cores, FlatFractal shows up to 40% degradation in total execution time. Atomic Coherence [Vantrease et al. 2011] attempts to simplify the coherence protocol by separating out the races from the protocol. The design avoids the protocol races by requiring each coherence action to be guarded by a mutex. It uses nanophotonics for performing these mutex operations with low latency because they are now on the critical path. As a result, Atomic Coherence eliminates transient states in the coherence protocol but heavily relies on a specific type of on-chip network.
In contrast to these approaches, DeNovo reduces the complexity of the cache coherence protocol by following a hardware-software co-design approach that eliminates races and thereby reduces the overall verification overhead. The basic DeNovo protocol is shown to be performance-competitive with MESI, and its simplicity enables optimizations that significantly improve performance and network traffic, thereby also potentially saving energy. The DeNovo coherence protocol used in this article assumes deterministic software. DeNovo is an ongoing research project seeking to broaden the class of software supported by such simple protocols. Recently, we extended DeNovo to support disciplined lock-based programs using dynamically generated access signatures to guide self-invalidations Sung et al. 2013 Sung et al. , 2014 . These extensions are rather small, and we believe can be verified independently. We are also developing simple extensions to support more complex lock-free synchronization algorithms. With these extensions, DeNovo will be able to support all codes correctly as long as synchronization operations are explicit. 8 For legacy codes that do not contain annotations for self-invalidations or generating dynamic access signatures, performance (but not correctness) may be affected. DeNovo's vision of using disciplined parallel programming practices for better performance and programmability is shared by other projects as well, including for complex system code such as operating systems (e.g., Barrelfish [Baumann et al. 2009]) . This paper provides an objective study to quantify the complexity advantages of such approaches.
CONCLUSION
With the era of manycores, there is a renewed debate in the community as to whether we have tamed the complexity of hardware coherence enough to employ it in systems with hundreds of cores or whether we should abandon it for other software alternatives. Although some have argued that hardware coherence is here to stay [Martin et al. 2012] and continue to work on even more complex extensions [Aisopos and Peh 2011] , others are proposing systems that use software-directed coherence [Borkar 2011; Choi et al. 2011; Howard et al. 2010; IntelSCC 2009; Kelm et al. 2009 ]. This article adds objective results from a concrete case study to this debate. We attempted to formally verify a publicly available, state-of-the-art implementation of the MESI protocol. Surprisingly, we found that after four years of extensive use in the architecture community, the protocol implementation still had several bugs. These bugs were hard to diagnose and fix, requiring new state transitions. In contrast, verifying a far less mature, hardwaresoftware co-designed protocol, DeNovo, revealed fewer bugs that were much easier to fix. After the bug fixes, we found that MESI took 20X longer to model check and required significant simplifying assumptions for the verification to be tractable. We conclude that even with a multitude of prior work, hardware coherence protocols are still difficult to design and understand. Furthermore, although it is difficult to define a single metric to quantify the relative complexity of protocols or to generalize from two design points, our results indicate that hardware-software co-designed protocols may offer a simpler alternative. This article also reinforces the necessity of employing formal verification techniques to prove the correctness of these protocols, but these techniques still require simplifying assumptions.
