As the level of parallelism in manycore processors keeps increasing, providing efficient mechanisms for thread synchronization in concurrent programs is becoming a major concern. On cache-coherent sharedmemory processors, synchronization efficiency is ultimately limited by the performance of the underlying cache coherence protocol. This article studies how hardware support for message passing can improve synchronization performance. Considering the ubiquitous problem of mutual exclusion, we devise novel algorithms for (i) classic locking, where application threads obtain exclusive access to a shared resource prior to executing their critical sections (CSes), and (ii) delegation, where CSes are executed by special threads. For classic locking, our HYBLOCK algorithm uses a mix of shared memory and hardware message passing, which introduces the idea of hybrid synchronization algorithms. For delegation, we propose MP-SERVER and HYBCOMB: the former is a straightforward adaptation of the server approach to hardware message passing, whereas the latter is a novel hybrid combining algorithm. Evaluation on Tilera's TILE-Gx processor shows that HYBLOCK outperforms the best known classic locks. Furthermore, MP-SERVER can execute contended CSes with unprecedented throughput, as stalls related to cache coherence are removed from the critical path. HYBCOMB can achieve comparable performance while avoiding the need to dedicate server cores. Consequently, our queue and stack implementations, based on the new synchronization algorithms, largely outperform their most efficient shared-memory-only counterparts.
INTRODUCTION
As industry is shifting toward manycore processors, it is increasingly important to put the constantly growing number of cores to good use. For some types of applications, such as scale-out workloads typically found in data centers, this is not a problem because of their embarrassingly parallel nature. There are, however, applications whose parallelization requires significant effort, as they contain data or objects intensively shared by multiple threads. To ensure consistency, threads must access such shared parts of the program state in a synchronized fashion. Whether synchronization is implemented using critical sections (CSes) or nonblocking (lock-free) algorithms, it creates Authors' addresses: D. Petrović, EPFL Innovation Park, Lausanne, Switzerland; email: darko.petrovic@ alumni.epfl.ch; T. Ropars, Laboratoire d'informatique de Grenoble, France; email: thomas.ropars@imag.fr; A. Schiper, Ecole Polytechnique Fédérale de Lausanne (EPFL), Switzerland; email: andre.schiper@epfl.ch. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. Coherent shared memory and message passing coexist in some recent hybrid processors, such as Tilera's TILE-Gx processor family [Tilera 2014 ]. As such, it provides a large design space for synchronization primitives. It is also an ideal testbed to experimentally compare shared-memory and message-passing approaches. In this work, we consider the problem of executing CSes and use TILE-Gx to study how hardware message passing can make it more efficient than with classic shared-memory techniques.
Our findings indicate that state-of-the-art solutions for efficient CS execution, both lock-based (MCS [Mellor-Crummey and Scott 1991] , CLH [Craig 1993; Magnusson et al. 1994] ) and delegation-based (RCL [Lozi et al. 2012] , CC-SYNCH [Fatourou and Kallimanis 2012] ), waste much time on activities related to cache coherence. When CSes are short, this dominates all other overheads, especially in case of delegation. To overcome this problem, we take advantage of hardware message passing and present three solutions: HYBLOCK, a classic lock; MP-SERVER, a simple server-based approach; and HYBCOMB, a combining algorithm. Whereas adapting the server-based approach used in shared-memory systems to message passing is straightforward, the design of HYBLOCK and HYBCOMB involves more algorithmic complexity. As their names suggest, HYBLOCK and HYBCOMB are hybrid algorithms that rely both on cache-coherent shared memory and hardware message passing for synchronization. In HYBLOCK, the lock state is kept in shared memory, but the lock is handed over between contending threads using message passing. In HYBCOMB, message passing is used to exchange requests and responses between the combiner and other threads, whereas shared memory is used to manage combiner identity (which would be complex and inefficient to do using message passing).
We evaluate the performance of HYBLOCK, MP-SERVER, and HYBCOMB by implementing ubiquitous linearizable [Herlihy and Wing 1990 ] concurrent objects, namely counters, queues, and stacks. Experiments with counters implemented using classic locks show that HYBLOCK outperforms MCS and CLH by up to 1.35x. With delegation, the difference is more striking: MP-SERVER outperforms CC-SYNCH and RCL by up to 4.3x. This is because in high concurrency levels, virtually no stalls remain on the critical path of the server. HYBCOMB also largely outperforms the shared-memory-only solutions and can achieve performance close to the one of MP-SERVER while avoiding the need to dedicate cores. Compared to other queues and stacks, our new implementations on top of MP-SERVER and HYBCOMB reach up to 2x and 1.5x higher throughput, respectively, shedding light on the advantages of hardware message passing for synchronization.
In summary, the contributions of this work are the following. We analyze the performance limitations of state-of-the-art solutions for efficient CS execution over cachecoherent shared memory in Section 3. In Section 4, we describe HYBLOCK, MP-SERVER, and HYBCOMB, our synchronization solutions based on hardware message passing. This includes the full specification of HYBLOCK and HYBCOMB, which are, to the best of our knowledge, the first synchronization algorithms that exploit the hybrid nature of contemporary processors. We present an extensive evaluation of HYBLOCK, MP-SERVER, and HYBCOMB in Section 5. On the example of linearizable counters, queues, and stacks, we show that they perform significantly better than their most efficient known sharedmemory counterparts. Additional considerations are provided in Section 6, and related work is discussed in Section 7. We conclude in Section 8.
SYSTEM MODEL
We assume a set of T sequential threads that can communicate both by issuing operations to a coherent shared memory and by directly exchanging messages.
Cache coherence. In the cache-coherent shared-memory model, threads operate on cached copies of shared variables. We assume a model adapted from the one by Sorin et al. [2011] . A processor chip is composed of single-threaded cores. Each core has its local, private data cache. All cores have access to a globally shared memory through an interconnection network. The cache coherence protocol maintains the single-writermultiple-reader invariant: at any given time, either a single core has read-write access to a cached variable or some cores have read-only access [Sorin et al. 2011] . Remote memory references (RMRs) are accesses to shared variables that involve communication on the interconnection network. In this model and assuming write-back caches, reading a shared variable generates an RMR if the core does not hold a copy of the variable in either mode. Writing a shared variable generates an RMR if the core does not hold a copy of the variable in read-write mode.
Shared-memory operations. The memory is an array of 64-bit locations. Similarly to most related studies, we assume that the memory is sequentially consistent [Lamport 1979] . Supported operations on a memory location a are the standard read(a), write (a, v) 
CRITICAL SECTIONS OVER CACHE-COHERENT SHARED MEMORY
This section details existing techniques for the efficient execution of CSes on cachecoherent processors. It explains how their performance is influenced by the underlying cache coherence protocol. In the discussion, we address two common ways to ensure mutual exclusion: classic locks and delegation.
Classic Locks
CSes are usually implemented using locks. In this context, the basic technique to improve scalability is to introduce local spinning [Mellor-Crummey and Scott 1991] : each thread polls on a different variable that stays in its local cache. As a result, the number of RMRs per lock acquisition is constant-that is, it does not depend on the number of threads competing for the lock. This ensures that the performance of the lock does not deteriorate with increased contention, which is a problem faced by simple locks without local spinning, such as test-and-set [Herlihy and Shavit 2008] .
In a lock implementation, local spinning is most often ensured by maintaining an explicit or imlicit queue of threads that wait to acquire the lock. When the thread t holding the lock releases it, t lets the next thread t in the queue acquire it by writing to a shared variable on which t is spinning. Because of this, such locks are often referred to as queue locks. Irrespective of the implementation, a queue lock handover on a cache-coherent system has at least two RMRs on the critical path: one from the thread that releases the lock, to write that the next thread can proceed, and one from that next thread, to read the written information. This is depicted in Figure 1 , where thread a is handing over a lock to thread b by setting f lag b. Since b is spinning on f lag b, an RMR is triggered when a tries to write to f lag b, to bring f lag b in readwrite mode in the cache of the core where a is running and invalidate the cached copy b is holding. Subsequently, reads from b do not hit in the local cache anymore, which triggers another RMR to bring a read-only copy of f lag b in the cache of the core executing b. Note that the two RMRs can partially overlap, as the second one can be triggered as soon as f lag b is invalidated in the cache of b. Nevertheless, each of them typically involves several message exchanges between cores, cache controllers, or other implementation-dependent agents, which significantly contributes to the lock handover time.
Apart from these synchronization RMRs, each CS execution likely implies additional overheads, inherent to locks. Namely, there are RMRs inside the CS itself: the data protected by the lock keep bouncing between caches, as every thread accessing them brings it to its local cache, possibly invalidating the other copies. Additionally, on architectures with weak memory models, every lock handover necessitates expensive memory fences to make sure that the new lock holder has the most recent copy of the data protected by the lock.
Delegation
Delegation is a way to avoid the data bouncing problem inherent to classic locks. The key idea is that instead of moving the data associated with a CS to the core that wants to execute the CS, the CS is executed on the core where the data are located. We can identify two approaches that exploit this idea: the client-server approach [Lozi et al. 2012; Cleary et al. 2013 ] and the combiner approach Kallimanis 2011, 2012; Hendler et al. 2010; Oyama et al. 1999; Klaftenegger et al. 2014] .
Remote core locking (RCL) [Lozi et al. 2012 ] is an efficient implementation of the client-server approach. A nonapplication thread (the server) is in charge of executing CSes. Application threads (clients) send requests to the server to execute a CS on their behalf. Assuming that data accessed inside the CSes are never accessed by application threads outside the CSes, these data remain in the cache of the server, ensuring that the number of RMRs during CS execution is minimized. Ideally, the only RMRs that remain on the critical path of the CS execution are the ones related to synchronization between the clients and the server. Figure 2 illustrates the execution of an RCL server. For client-server communication in RCL, each client thread has a dedicated cache line, which it uses as a bidirectional channel. When client i wants to execute a CS, it writes its request to the cache line channel i and then spins on that cache line until it receives a reply from the server. The server first reads the request from channel i . Since the last access to channel i was from client i writing the request, this read triggers an RMR (server stalls are represented in dark grey). Then, the server executes the CS. Finally, it writes to channel i to inform the client that the request has been processed. This write triggers another RMR to invalidate the client's copy of the cache line. The figure assumes high load (i.e., the server is never idle) and shows that in this case there are two RMRs at the RCL server per CS. Note that Figure 2 is somewhat simplified, as it assumes the instructions are not at all overlapped. On a real processor, the different RMRs might partially overlap, depending on the memory consistency model and other features of the processor at hand, resulting in fewer CPU stalls. Nevertheless, these RMRs remain an important source of overhead even on a processor with weak memory consistency (see Section 5).
Although keeping similar performance benefits, the combiner approach does not require dedicated servers [Hendler et al. 2010] . When a thread gets a lock associated with a CS, it becomes a combiner and executes operations of other threads that are waiting to access this CS in addition to its own. To prevent the combiner from starving if the number of operations of other threads to execute is high, the combiner role is handed over to another thread when the current combiner has served a predefined number of requests. CC-SYNCH [Fatourou and Kallimanis 2012] is, to our knowledge, the most efficient combiner-based approach. Since the combiner changes over time, the synchronization mechanism is more complex than in RCL. Nevertheless, with a thread acting as a combiner, CC-SYNCH is similar to RCL with respect to RMRs: it generates one RMR to read a request from another thread and then generates another RMR to inform that thread that the operation has been performed.
The server-based approach has the advantage of being simple and very efficient in cases where a small number of clearly identified CSes are highly contended [Lozi et al. 2012] . On the other hand, combining is more flexible, which comes at the expense of requiring more complex synchronization between threads. Indeed, combiners adapt themselves automatically to the load: if a CS is highly contended, all CPU cycles of one core will be temporarily allocated to it, but if no thread tries to execute a CS, no resources are consumed.
Both with RCL and CC-SYNCH, only two RMRs related to thread synchronization remain on the critical path of a CS execution. These two RMRs, however, can have a big impact on throughput if the code to execute in the CS itself contains few or no RMRs.
It should be noted that although delegation is efficient when contention is high, it has some disadvantages compared to classic locks. First, it imposes some usability constraints. In particular, since the thread executing the CS is usually not the one requesting it, access to a thread-specific state (thread-local and stack variables) and performing any thread-specific work within a CS is harder. Second, when it comes to performance, simple classic locks might turn out to be more efficient if contention is low [David et al. 2013 ]. We will briefly compare the performance of classic locks and delegation in Section 5.3.3. 
CRITICAL SECTIONS USING MESSAGE PASSING
We discuss three ways to leverage hardware support for message passing to execute CSes efficiently. We start by presenting a hybrid lock that uses both shared memory and hardware message passing for thread synchronization. Then we move to delegation: in this context, we first explain how hardware messaging can be beneficial by addressing the client-server approach. Second, we present a novel hybrid combining algorithm that takes advantage of the given insights.
The Classic Lock (HYBLOCK)
In Section 3, we saw that even the most efficient classic locks require at least two RMRs during lock handover, since the next thread to take the lock is signaled using a flag in shared memory. If the hardware allows direct exchange of messages, it is natural to think of an algorithm that would use a message to hand over the lock to the next thread: instead of writing to and spinning on a flag, the current lock holder should signal the next one by sending a message, as shown in Figure 3 . In this way, the overhead of cache coherence is removed from the lock handover (although one full message latency remains on the critical path). Still, shared memory remains a convenient means to implement synchronization. Indeed, to implement mutual exclusion over message passing only, one can use a distributed mutual exclusion algorithm, although it is typically expensive in terms of number of messages. For example, even the well-known NTA algorithm [Naimi et al. 1996] necessitates at least O(log n) messages over a long run in a system with n processes, whereas O(1) is easily achievable if a shared memory is also present, as we will now demonstrate.
HYBLOCK is designed to take advantage of both message passing and cache-coherent shared memory. When the lock is not contended, threads simply acquire and release it by modifying a shared-memory variable. In the event of contention, the lock is handed over between threads using message passing, avoiding synchronization RMRs and thus improving throughput. More precisely, threads form a logical queue. Upon finishing its CS, every thread directly transfers the lock to the next one in the queue just by sending a message, without modifying the shared state.
The pseudocode of HYBLOCK is given in Algorithm 1. The lock state is represented by an integer. A special value, UNLOCKED, denotes that the lock is free. Otherwise, this variable contains the id of the last thread requesting the lock. Besides, every thread has a private (thread-local) variable next, used in some execution scenarios to store the id of the next thread to which the lock will be handed over.
To acquire the lock, a thread t first executes a SWAP operation on the lock state, writing its id and returning the previous value (line 8). If that value is UNLOCKED, thread t takes the lock and no message passing is needed. Otherwise, t is queued after prev, which was the previous thread to request the lock (i.e., execute the SWAP at if prev = U NLOC KED then 10:
send ( prev, id) 11:
tmp ← receive (1) 12:
if tmp = ALLOW then 13:
receive (1) 15: lock release(lock: Lock) send(tmp, ALLOW ) line 8). Thread t then sends a message with its id to prev, thus letting prev know that it should eventually reply with ALLOW to hand over the lock. Thread t then waits at line 11 for that to happen. When ALLOW comes in, t has successfully acquired the lock. Note, however, that the incoming message might not be ALLOW: as soon as t has executed the SWAP at line 8, another thread may request the lock, in which case it will line up directly after t, sending t a message at line 10. In this case, t stores the id of its successor in next, for use when it releases the lock. When t decides to release the lock, it first checks if next contains a valid thread id, in which case the appropriate thread is contacted and next is reset. If the identity of the next thread is not known, this means that either (i) no thread has lined up to take the lock after t or (ii) its message is yet to be received. Next, if the message queue is empty, there is a chance that (i) is satisfied, so t can try to unlock using a CAS on the lock state (line 21). If the CAS succeeds, there were no contending threads and the lock is successfully released. On the other hand, if the message queue is not empty, or the CAS fails, we can conclude that the message from the successor has either arrived or will eventually arrive. Therefore, t blocks, waiting for the message from its successor, before finally informing the successor that it can acquire the lock.
Properties. HYBLOCK achieves several properties desirable in theory and practice, outlined as follows:
-Fast contention-free execution: HYBLOCK is very lightweight in absence of contention, and one operation on shared data is sufficient to both take and free the lock. With HYBLOCK is able to remove RMRs from the critical path of lock handover. Indeed, when multiple threads are waiting for the same lock, the handover is done without any operations on shared data-that is, the CAS at line 21 is never executed, only message-passing operations. When it comes to limitations of HYBLOCK, it is clear that nested locking is not supported out of the box, as messages from different lock instances can arbitrarily interleave. If desired, this limitation can be removed by associating special tags with messages, as well as by replacing the local next variable with a local stack, which would keep one next value for every level of nesting. This, however, is likely to decrease performance.
It should also be noted that lock handover performance, although improved, is still limited by the latency of the message from the current to the next lock owner. We will shortly see that with delegation, even this message latency can be removed from the critical path.
The Server Approach (MP-SERVER)
A client-server approach, such as RCL, is a natural fit for message passing. Indeed, RCL's client-server communication layer can be seen as an implementation of message passing over shared memory. Instead, we simply leverage hardware message passing support to implement client-server communication. We refer to this solution as MP-SERVER. Based on the model introduced in Section 2, Figure 4 explains why MP-SERVER may have better performance than its shared-memory counterpart. Compared to Figure 2 , stalls can be avoided for two reasons. First, the server reads requests from the local message queue, without any remote actions that would cause it to stall. Second, the server does not wait for the actual message transmission to take place when it sends a response. When and how the messages are actually sent to their destinations is the responsibility of the underlying hardware message passing implementation. Therefore, if hardware message passing is used, we expect to be able to completely remove stalls related to synchronization from the critical execution path. 
The Combiner Approach (HYBCOMB)
We now detail HYBCOMB, our combining algorithm tailored to take advantage of message passing. We start by describing the main principles of combining techniques over shared memory to identify how message passing can be used to improve performance.
Main principles. In combining algorithms, threads interact for two purposes: (i) electing a combiner and (ii) exchanging information between the combiner and threads that have operations to be executed in mutual exclusion. In shared-memory combining algorithms [Oyama et al. 1999; Hendler et al. 2010; Fatourou and Kallimanis 2012] , these two tasks are handled by a single shared object: a list of requests. To execute an operation, a thread adds a request to the list. The current combiner traverses the list to fetch and execute requests. When the current combiner wants to return, it hands over the combiner role to the thread owning the next request in the list (if there are no requests to be executed, the next thread that inserts a request will become the combiner).
HYBCOMB uses hardware message passing for synchronization between the combiner and the other threads. As long as the combiner does not change, synchronization works as with MP-SERVER (Figure 4 ). Still, we use shared memory for managing combiner identity. In a nutshell, HYBCOMB works as follows. When a thread t wants to execute a request, it first checks the identity of the combiner through a shared variable. If a combiner is available and ready to handle the request, t sends a message to that combiner. If not, t tries to promote itself to a combiner by executing CAS on the variable that keeps the combiner identity.
Managing combiner identity using message passing would be complex and probably inefficient. The main problem is that a thread acting as a combiner has to stop combining at some point, which must be synchronized with actions of other threads. To get its operation executed by a combiner, a thread has to get the identity of the combiner thread and send a request to it. If the combiner identity changes in the meantime, the operation will never get executed. Dealing with this problem using message passing would require either a delegated thread (which is exactly what the combiner approach is trying to avoid) or intensive communication between threads (e.g., broadcast).
Detailed description. Algorithm 2 describes HYBCOMB. The interface is the same as that of CC-SYNCH: when a thread wants to execute a CS, it calls the apply op method, providing the corresponding HYBCOMB instance, a pointer to the function to execute and its arguments. Note, however, that HYBCOMB is not just a simple adaptation of existing combining algorithms, where message passing is used instead of a shared list to make the combiner thread aware of the requests to execute. As already mentioned, using message passing requires us to be able to identify the combiner thread to which requests should be sent. This should be carefully handled, especially at the time the combiner changes. This problem does not exist in combining techniques fully based on shared memory, as it is the combiner thread that fetches requests from a shared data structure.
The code executed by the active combiner are lines 26 through 48. Algorithm 2 ensures that lines 26 through 47 are executed in mutual exclusion (i.e., that there is a single active combiner at a time). To manage combiner identity, a data structure called Node is used. Each thread owns a reference to a different node (my node). The id of the thread owning a node is saved in the field Node.thread id. Managing combiner identity is done using the shared pointer last registered combiner. To become a combiner (lines 20 through 24), a thread t tries to execute a CAS operation on last registered combiner to make it point to its node. If the CAS succeeds, t keeps a pointer to the node corresponding to the previous last registered combiner in its local variable last reg. This mechanism can be seen as building a logical queue where the head of the queue is the current active combiner and the tail is the last registered combiner, each thread in the queue having a reference to the . The current combiner is t 6 , and the next one is t 2 , polling on t 6 's node. The requests of t 1 , t 3 , t 4 , and t 5 will be combined by t 6 and t 2 .
predecessor in its last reg variable. The Node.combining done flag is used to synchronize the threads in the queue. Before starting executing as a combiner, a thread spins on the combining done flag of its predecessor (line 22), which is set by the predecessor when it finishes combining (line 47). Upon calling apply op, a thread t first tries to register its request with last registered combiner. It does so by performing a fetch-and-increment on the Node.n ops field of the corresponding node (line 14). This field guarantees that one combiner will receive and execute at most MAX OPS requests of other threads. If the threshold MAX OPS is not reached, t sends its request to the combiner using message passing (line 16) and waits for a response (line 17). If the last registered combiner cannot accept any new request, t tries to register itself as a combiner as already explained.
Once t becomes the active combiner, it first executes its own request (line 26). Then it reads messages from its message queue, processes requests, and sends responses. When its message queue is empty, t decides to stop combining and announces it by writing MAX OPS to its n ops field (line 33). Since it does so using SWAP, it retains the old value of n ops (in total ops), which is the total number of requests that it has to serve as a combiner. It then finishes its combining round by serving the remaining requests, if any (lines 37 through 40).
Before returning, t must get the node that it will use the next time it calls apply op (we want to avoid allocating a new node for every apply op call). Obviously, t cannot use the same node because that requires the combining done field to be reset, but t cannot know when the next combiner will have read this field. As a solution, only one additional node is allocated for all n threads, and t gets the node that was used by the previous combiner (pointed by departed combiner) (lines 42 through 47): t knows that the combining done field of this node can be reset since t was the thread spinning on this node. Finally, note that t must not reset the n ops field of its new node at this point because other threads might still have an old reference to this node in their last reg variable (lines 12 through 14): if n ops were reset, these threads could send requests to t while it is not a combiner. Thus, t will reset n ops only once it registers as a combiner again (line 21). Figure 5 illustrates an execution of HYBCOMB, assuming MAX OPS = 3, where threads t 1 through t 6 are simultaneously calling apply op to execute their CSes. Thread t 6 is the current combiner. It executes its own request and will subsequently execute the requests of t 3 , t 4 , and t 1 , which are waiting in its message queue. Since at most three requests can be executed by a thread on top of its own, the request from t 2 could not be served by t 6 . Consequently, t 2 has executed CAS (line 20 of Algorithm 2) and registered as a new combiner: it is now spinning on n t6 .combining done, waiting for t 6 to hand over the combiner role. Thread t 5 will have its request executed by t 1 , once the latter starts combining. Note that the nodes of the current and future combiners (t 6 and t 2 in Figure 5 ) form an implicit queue: the head of the queue is the dummy node (pointed to by lock.departed combiner), whereas the tail is the last registered combiner (pointed to by lock.last registered combiner).
Additional comments. Before sketching the proof of correctness, we make a few remarks on the way HYBCOMB works. First, we can note that registering as a combiner (line 20) and resetting the n ops counter (line 21) are not atomic. This does not affect the correctness of the algorithm. In the very unfortunate case where a thread t executes the FAA at line 14 while t is between those two lines, t will simply not manage to register its request with t, and so it will try to become the next combiner. This could merely result in a performance penalty, as t would only have its own request to execute as a combiner. Results presented in Section 5 show that this rarely occurs in practice.
Note also that the first while loop in the request execution part (lines 28 through 31) is not necessary for correctness: the thread can decide to stop combining as soon as it has executed its own request. Still, this loop is beneficial for performance, as postponing the SWAP at line 33 increases the combining potential.
HYBCOMB uses a CAS operation like some other combining algorithms [Oyama et al. 1999; Hendler et al. 2010 ], but unlike CC-SYNCH [Fatourou and Kallimanis 2012] . It is well known that CAS can impair performance (because it can repeatedly fail, causing contention) as well as fairness (a thread can starve if it executes CAS in a loop and persistently fails). We still choose to use CAS and not SWAP at line 20 for the following two reasons: (i) if SWAP is used and several threads try to register as combiners, they all succeed but some of them only have their own request to execute as a combiner, whereas with CAS only one thread manages to register as a combiner and potentially execute all other requests; (ii) the CAS is not expected to be a hot spot in HYBCOMB, as it is only executed when a thread wants to register as a combiner. Experiments presented in Section 5 confirm the second point. If desired, a middle ground would be to use SWAP only if CAS fails several times.
Correctness proof (Sketch).
The key idea is to show that Algorithm 2 maintains a queue of Nodes, denoted by C S queue (queue for entering the CS corresponding to lines 26 through 48), where each node represents a thread (except for the head). As discussed earlier and shown in Figure 5 , the head of the queue is the node pointed to by lock.departed combiner.
Other nodes in the queue, if any, correspond to current and future combiners-that is, threads that want to enter the CS. The operation insert into C S queue corresponds to a successful execution of CAS at line 20. The operation remove from C S queue corresponds to the execution of lines 42 through 47. The queue C S queue is represented as follows. The tail of C S queue is the node pointed to by the field lock.last registered combiner (line 8). The predecessor of node n t (representing thread t) is the node pointed to by last reg t (line 12). The head of C S queue is node n t such that last reg t .combining done = true.
In addition to nodes representing threads, C S queue includes one dummy node, initially new node (line 6). The dummy node is the only node in C S queue with the field combining done = true. An empty queue contains only the dummy node: in this case, lock.last registered combiner points to the dummy node.
Note that the field lock.departed combiner plays no role with respect to C S queue . It points to the dummy node: whenever some thread t leaves the CS, the node n t becomes the new dummy node and the previous dummy node becomes n t . The reason for this has been explained at the end of the Detailed description section. LEMMA 4.1. Algorithm 2 maintains the queue structure just described.
PROOF (SKETCH). The queue structure is modified by a successful execution of CAS (line 20)
. We prove by induction that the queue structure always holds.
-Base step: The empty queue structure holds initially by lines 6 through 8. -Induction step: By lines 12 and 20 (CAS), if the queue structure holds before executing CAS, then it is easy to see that the queue structure also holds after successfully executing CAS.
We also show that pointer lock.departed combiner always points to the dummy node. Initially by line 7, lock.departed combiner correctly points to the dummy node. Moreover, lock.departed combiner is updated at line 44, pointing to node n t representing thread t that just left the CS. By line 47 (setting combining done to true), the node becomes the new dummy node.
PROPOSITION 4.2. Lines 26 through 47 are executed in mutual exclusion (one combiner at a time).

PROOF (SKETCH). A thread t can execute lines 26 through 47 only after a successful execution of CAS (line 20)
. By Lemma 4.1, t is correctly introduced in CS queue . By the same lemma, only node n t at the head of CS queue is such that last reg t .combining done = true. Thus, by line 22, only one thread can enter the CS.
It follows from Proposition 4.2 that Algorithm 2 is safe: because CS is executed in mutual exclusion, thread operations (pointed to by the f unc ptr argument) are also executed in mutual exclusion. It can be shown that linearizability follows (since an operation of thread t can only be executed between the moments of t entering and leaving apply op). We now show that liveness also holds, for instance, that Algorithm 2 is deadlock free (if there are threads calling apply op, some thread eventually executes its operation and returns from apply op).
LEMMA 4.3. For all nodes n t with n t .n ops < MAX OPS, we have n t in CS queue .
PROOF. We show that the converse holds. Assume that node n t is not in CS queue . When n t is in its initial state, we trivially have n t .n ops ≥ MAX OPS. Consider now t entering the CS and later leaving the CS. To reset n t .n ops, a thread executes line 21. This can happen only if the CAS at line 20 succeeds, which means that the thread enters CS queue . Before leaving the CS, by line 33, we trivially have n t .n ops ≥ MAX OPS, which terminates the proof.
LEMMA 4.4. At lines 29 and 38, only requests (operations to execute) can be received (not responses).
PROOF. For a contradiction, assume r to be the first response received at line 29 or 38 by some thread t. Response r must have been sent by some thread t at line 16, 30, or 39. Response r cannot have been sent at line 16, because only requests are sent at line 16.
So assume that r is sent by some thread t at line 30 or 39. However, thread t could only have received a request at line 29 or 38 (since t is the first to have received a response at those lines). Therefore, t has sent a response at line 30 or 39: a contradiction.
LEMMA 4.5. If the message queue of thread t contains a request, then n t is in CS queue .
PROOF. Assume that thread t sends a request to thread t. By lines 12 and 16, the message is sent to t = last reg t .thread id. We have to show that n t is in CS queue when t receives the request from t . Clearly, n t is at the tail of CS queue when t executes line 12. If t sends its request to t at line 16, then line 14 was successfully executed by t -that is, n t .n ops < MAX OPS before t executes line 14. By Lemma 4.3 n t is in CS queue at this time. By Lemma 4.4, and since every thread must execute line 14 before sending a request to t, thread t cannot leave the CS before the request from t is received. Therefore, n t is still in CS queue when the request from t arrives in the message queue of t.
LEMMA 4.6. At line 17, thread t cannot receive a request (i.e., t can only receive the response to the request sent at line 16).
PROOF. Assume that t receives a request at line 17 while n t is not in CS queue . Therefore, the message queue of t contains a request. By Lemma 4.5, node n t is in CS queue : a contradiction.
LEMMA 4.7. A thread t that enters the CS eventually responds to all requests sent to t and leaves the CS.
PROOF. Thread t can block inside the CS in three ways: (i) an infinite loop at lines 28 through 31; (ii) an infinite loop at lines 37 through 40; and (iii) the receive at line 38 never returns. By Lemma 4.5, the message queue of t is empty before t executes line 20. To send a message to t, thread t must execute line 14 while n t .n ops < MAX OPS-that is, while t is between lines 22 and 33. Since n t .n ops is incremented before each message is sent, at most MAX OPS messages can be sent to t, so (i) and (ii) cannot hold. Since after line 33 total ops contains the exact number of requests that will eventually be sent to t, receive is called exactly once for each request, which precludes (iii). Since every receive is followed by a send with a response, t responds to all requests before leaving the CS.
LEMMA 4.8. If t sends request r at line 16, it eventually receives a response at line 17.
PROOF. By Lemma 4.5, the request from t is received by some thread t that is in CS queue . It follows from Lemma 4.7 and the construction of CS queue (Lemma 4.1) that t is, or will eventually become, the combiner. Additionally, by Lemma 4.7, t eventually responds to the request of t.
Finally, we have the following proposition.
PROPOSITION 4.9 (LIVENESS). Algorithm 2 ensures deadlock freedom.
PROOF. If thread t wants to execute some operation op, then either t eventually gets the response (Lemma 4.8) or t tries to enter CS queue (line 20). In the latter case, if t succeeds (executes the CAS successfully), then t eventually executes op (line 26) and leaves CS queue (Lemma 4.6). If t does not succeed the CAS, lock.last registered combiner has changed in the meantime, which means that some other thread has successfully executed the CAS and will thus eventually execute its operation.
Recall that starvation freedom is not guaranteed in Algorithm 2 because of the CAS, but it can be easily ensured if a SWAP is introduced (as discussed in the preceding Additional comments section).
EVALUATION
In this section, we implement and thoroughly evaluate the algorithms presented in Sections 3 and 4. We begin by introducing the used hybrid processor and our experimental setup. Next, we present experiments that evaluate different implementations of a concurrent counter. The analysis is then extended to more complex concurrent objects, namely queues and stacks. Finally, we discuss the generality of our results and their applicability to other platforms. 
Platform
We use the Tilera TILE-Gx8036, which integrates 36 cores, works at 1.2GHz, and features complete hardware support for both coherent shared memory and message passing [Tilera 2014 ]. The software part comprises GCC 4.4.6 and version 2.6.40.38-MDE-4.1.0.148119 of Tilera's custom Linux kernel. The memory consistency model is relaxed compared to x86, so a careful use of memory fences is necessary to avoid inconsistency. Each core has a dedicated hardware message buffer, capable of storing up to 118 64-bit words. The message buffer of each core is four-way multiplexed, which means that every per-core buffer can host up to four independent hardware FIFO queues, containing incoming messages. The user dynamic network (UDN) allows applications to exchange messages directly through the mesh interconnect without OS intervention. While exchanging messages, a thread must be pinned to a core and registered to use the UDN (but it can unregister and freely migrate afterward). When a message is sent from core A to core B, it is stored in the specified hardware queue of core B. The send operation is asynchronous and does not block, except in the following case: since messages are never dropped, if a hardware queue is full, subsequent incoming messages back up into the network and may cause the sender to block. It is the programmer's responsibility to avoid deadlocks that can occur in such situations. When a thread executes receive on one of the four local queues, the first message from the queue is returned. If there are no messages, the thread blocks. Messages consist of one or multiple words.
Methodology and Setup
We have implemented HYBLOCK, MP-SERVER, and HYBCOMB on the TILE-Gx, as well as several algorithms purely based on shared memory: the MCS [Mellor-Crummey and Scott 1991] and CLH [Craig 1993; Magnusson et al. 1994 ] queue locks; a test-and-set (TAS) lock; the CC-SYNCH combining algorithm [Fatourou and Kallimanis 2012] ; and SHM-SERVER, a server-based approach. SHM-SERVER can be seen as a simplified version of RCL [Lozi et al. 2012] , as it implements the same core mechanism (an array of cache lines, one for each client) but lacks support for some advanced features, such as nested CSes (note that this simplification does not decrease performance). The implementations have been carefully optimized and compiled with the O3 flag. Because of the relaxed memory model of the TILE-Gx, we have inserted memory fences where necessary to ensure correctness. In particular, with classic locks, a fence is always necessary before releasing the lock so that the next lock owner can read the most recent data. In the combiner implementations, fences are not necessary after every CS execution, but only on combiner switching, as long as shared data are protected by a single combiner. Some specific algorithms, such as the two-lock Michael and Scott queue (MS-Queue) [Michael and Scott 1996] (see Section 5.4), where one combiner is dedicated to enqueue operations and one combiner is dedicated to dequeue operations, require a fence after every operation. Similarly, server approaches never need a fence after operation execution if protected data are only accessed by the server, which is very often the case in practice for contended CSes [Lozi et al. 2012] .
To obtain the best possible performance, we augment all of the delegation implementations with a simple interface that allows a thread to send a unique opcode of the CS to the servicing thread rather than a function pointer. This allows the compiler to inline the function calls that the servicing thread makes for every CS, which results in a visible performance increase in most cases [Cleary et al. 2013] . It is worth noting that the results are qualitatively the same without this optimization.
We use the methodology commonly found in related studies [Fatourou and Kallimanis 2012; Hendler et al. 2010; Michael and Scott 1996;  Morrison and Afek 2013]. In each experiment, a specified number of application threads repeatedly execute operations on a concurrent object. After every operation, a thread executes a random number of empty loop iterations (at most 50). This simulates local work and prevents long runs, in which a thread would execute bursts of operations on a concurrent object in its local cache. To minimize interference caused by context switching, we assume a uniprogrammed environment, where each thread runs on a separate core (multiprogramming is discussed in Section 6). We pin threads to cores in ascending order (i.e., thread i is pinned to core i). With server-based approaches (SHM-SERVER and MP-SERVER), the server code is executed by thread 0, and other threads execute application code (the server position has a negligible performance impact). In case of classic locks and combining, all threads run the same code. Unless otherwise stated, the maximum number of requests a thread can combine in HYBCOMB and CC-SYNCH is set to 200 (we analyze this choice later in this section). Every value reported in the graphs is an average over 10 1-second runs.
Microbenchmarks
For the sake of clarity, we evaluate classic locks and delegation separately. After discussing the key results related to each of the techniques, we compare them directly to complete the analysis.
5.3.1. Classic Locks. We first use each of the locks to implement a simple concurrent object-a counter. Figure 6(a) gives the counter throughput. HYBLOCK is the best performer, reaching 1.35x higher throughput than the MCS and CLH locks in high concurrency levels. Even in lower degrees of concurrency, HYBLOCK still performs visibly better than the other queue locks. We attribute this improvement to the optimized way of handing over the lock: HYBLOCK uses a message, whereas MCS and CLH use spinning. As expected, TAS scales poorly.
Figure 6(b) shows average operation latencies observed by the threads in the same benchmark. Even with one thread (no concurrency at all), HYBLOCK is more efficient than most alternatives, except for TAS. Although message passing is not used in this case, HYBLOCK has the advantage that it does not need any node housekeeping, which is necessary with MCS and CLH. We can also see that MCS and CLH deliver nearly indistinguishable performance, except with 2 to 4 threads, where CLH is more robust. We believe that this is architecture and implementation dependent. In any case, the general trend we can observe with all of the classic locks is that the average latency increases rapidly as concurrency grows, even with HYBLOCK as the most efficient option.
We now examine what happens when the CS body is longer. We implement a CS in which the elements of an integer array are incremented. We vary the number of increments in the CS and observe the maximum throughput (with 36 threads) in Figure 6 (c). When the CS contains no shared-memory accesses (zero increments), we are left with the pure synchronization overhead of the lock/unlock pair. This overhead is about 2x lower with HYBLOCK than with the shared-memory queue locks. As soon as we add shared data manipulation in the CS, even only one increment, there is a sharp performance decrease, mostly because the shared data bounces between cores, as described in Section 3. As the number of increments increases past one, the performance decreases more slowly. This is because one cache miss brings over eight array elements, so only one in eight increments is particularly expensive because of a cache miss. In addition, it should be noted that prefetching can hide some of the latency of the subsequent cache misses. Hence, with 25 increments, the advantage of HYBLOCK over MCS and CLH is still about 1.18x.
In conclusion, HYBLOCK is visibly more efficient than MCS and CLH, both in low and high concurrency levels. What limits further performance improvement, however, are overheads inherent to classic locks. Those are in the first-line cache misses inside the CS, but the memory fences associated with every CS play an important role on this platform as well. Next, we will see that more significant performance gains are possible with delegation, as the mentioned inherent overheads do not exist.
5.3.2. Delegation. Again, we start by implementing a concurrent counter. Figure 7(a) shows the throughput of the counter implementations. The approaches that use hardware message passing are clearly faster. MP-SERVER is most efficient in all concurrency levels. Its reaches 4.3x higher throughput than SHM-SERVER, indicating that message passing supported natively is much more efficient than emulation over shared memory. When it comes to combining, HYBCOMB consistently outperforms CC-SYNCH. This is especially pronounced in higher concurrency levels, where HYBCOMB reaches about 2.5x higher throughput. CC-SYNCH and SHM-SERVER have very similar performance, indicating that CC-SYNCH manages to avoid dedicating cores at virtually no performance cost. On the other hand, the difference between MP-SERVER and HYBCOMB is much more visible. We will shortly identify the source of this difference and explain how it can be minimized. We can also see a big improvement compared to the throughput of classic locks (Figure 6(a) ), which confirms that delegation is more resilient to contention.
Figure 7(b) shows the average request latency observed by application threads. Again, MP-SERVER has by far the lowest latency, even in low concurrency levels, indicating that hardware message passing is useful even latency wise. HYBCOMB has lower latency than CC-SYNCH, which becomes especially visible as concurrency increases. The only noteworthy exception is single-threaded performance, where CC-SYNCH is better than HYBCOMB. We believe that this is mainly because an isolated thread running CC-SYNCH executes only one atomic instruction per operation, whereas HYBCOMB executes three. Since atomic instructions on the TILE-Gx are not executed in the local L1 cache, but in the L2 cache of the core that is home to the corresponding memory word (most likely a distant core), this results in a higher latency. As concurrency increases, the latency of both CC-SYNCH and HYBCOMB dips at one point before continuing to grow (between 14 and 17, respectively, 14 and 24 application threads). This is due to more intensive combining, as we will confirm shortly.
As with classic locks, we continue by replacing one counter by an array of 64 counters, incremented in a loop with a varying number of iterations. Figure 7 (c) presents the results. MP-SERVER and HYBCOMB can lead to better performance mainly when CSes are short, mainly because synchronization is very cheap with delegation, so the time to execute a slightly longer CS body already dominates synchronization overheads. At 15 loop iterations, the difference between the best (MP-SERVER) and the worst (SHM-SERVER) performer drops to about 10%, as the time to execute the CS body (which is the same with all of the implementations, if we ignore combiner switching) dominates the entry/exit overhead.
One might question the choice of the maximum allowed combining rate (MAX OPS). If MAX OPS is too low, less combining is possible, which negatively affects throughput. On the other hand, increasing it above a certain limit does not increase throughput further, as the cost of combiner switching becomes negligible, but can result in higher latency observed by the combining thread. The optimal value heavily depends on the application needs and anticipated concurrency level. In Figure 8 (a), we examine how the maximum achievable counter throughput changes with MAX OPS. Very high MAX OPS values provide little benefit in terms of throughput of CC-SYNCH. On the other hand, as we increase MAX OPS up to 1,000, the throughput of HYBCOMB continues to grow, barely showing signs of saturation. Combining is so fast with HY-BCOMB that the impact of combiner switching is visible even when MAX OPS is high. This explains the difference between MP-SERVER and HYBCOMB observed in Figure 7 (a) (recall that MAX OPS is set to 200 there). The throughput of HYBCOMB levels off at about 88 Mops/sec, with MAX OPS set to 5,000. Therefore, one can achieve nearly as high throughput with HYBCOMB as with MP-SERVER, if willing to trade the throughput increase for sporadic latency "hiccups" for some requests (when the requesting thread becomes a combiner). We have chosen a moderate value of 200 for our experiments, since it already provides the highest possible throughput with CC-SYNCH and decent results with HYBCOMB. Now we more precisely identify the reason for the observed performance improvement with HYBCOMB and MP-SERVER in the counter benchmark. Figure 8(b) shows the average number of CPU stalls per operation on the servicing thread under maximum load, as well as the total number of cycles per operation.
3 The advantage of HYBCOMB and MP-SERVER becomes clearer: the servicing thread is virtually never stalled, whereas CPU stalls account for more than 50% of the cycles of the servicing thread in CC-SYNCH and SHM-SERVER. There are no event counters that would provide more fine-grained information on the source of stalls, but we believe that they mostly originate from the load-store unit, which has to wait for the cache coherence protocol to fetch data. This confirms the reasoning from Section 3: cache coherence-related stalls are an important source of overhead, and hardware message passing is helpful in avoiding them.
Figure 8(c) shows the average combining rate with HYBCOMB and CC-SYNCH. Ideally, we expect it to reach MAX OPS under high load. At the beginning, the actual combining rate steadily grows and is approximately equal to the number of threads minus one. This is because a combiner manages to combine one request for all of the other threads. At that point, no thread has started the subsequent operation yet, so the combiner returns. As concurrency grows, more requests arrive at the combiner concurrently. As it takes more time to service them, there is more time for other requests to arrive before the combiner returns and so forth. This circular effect leads to a sudden sharp increase in the combining rate, which explains the latency dip that we observed in Figure 7 (b). As we can see in Figure 8 (c), in high concurrency levels CC-SYNCH reaches the desired combining rate, whereas HYBCOMB is slightly below it. This is because registering as a combiner and resetting the n ops field are not atomic. As explained in Section 4.3, an unfortunate thread interleaving could leave one combiner with no work to do because a new thread would register as a combiner before any request is associated with the current one. However, we can see that this has only a marginal effect on the combining rate in practice: despite somewhat lower combining rate, HYBCOMB still has much better performance than CC-SYNCH (Figure 7) .
Finally, recall that HYBCOMB uses CAS, but the presented graphs indicate that this does not cause visible performance degradation. This is because when concurrency is high, threads rarely execute CAS: they mostly send their requests to an active combiner. Indeed, we have measured as few as 0.1 executed CAS per operation (call to apply op) in high concurrency levels. This number is a bit higher when concurrency is not high enough to trigger high combining rates, but even then there are not more than 0.7 CAS per operation in multithreaded executions. Regarding fairness, we have measured the ratio between the highest and lowest number of operations executed by some thread (so 1 denotes ideal fairness). Across the whole concurrency spectrum, the highest value of this ratio with HYBCOMB is 1.2 and the average is 1.16. Even MP-SERVER, in which all requests are read from a hardware FIFO queue, has a ratio of nearly 1.1, only because some cores are nearer to the server, so they execute slightly more operations. Hence, the use of CAS in HYBCOMB does not impair fairness on this platform.
5.3.3. Locking Versus Delegation. In Section 3, we pointed out key differences between classic locks and delegation. Recall that from a pure performance perspective, delegation is more resilient to contention, whereas classic locks are expected to achieve better low-concurrency performance, which we now verify. Figure 9 gives the latency of a concurrent counter implemented using the different locks and delegation approaches:
it is a subset of latency data from Figures 6(b) and 7(b), represented as a bar chart. The results confirm that delegation is more resilient to contention: with five threads, all delegation implementations outperform the locks. But even in lower concurrency, the locks turn out not to be superior in all cases. In particular, perhaps surprisingly, MP-SERVER is by far the best performer, even with a single application thread (i.e., no contention at all): it even outperforms a simple TAS lock in this case. This might look surprising, as MP-SERVER includes communication with the server, which is avoided in classic locks and combining. Recall, however, that even a TAS lock requires an atomic operation (which is executed in a remote L2 cache on the TILE-Gx) and memory fences (to make sure that the next lock owner's view of protected data stays consistent). This turns out to be more costly than contacting the server using hardware message passing.
According to the presented data, hardware support for message passing might justify server-based CS execution even for uncontended locks. This, however, depends on the particular scenario, because of the already discussed downsides of this approach (server dedication, false serialization, the need to appropriately encapsulate CSes, etc.). Note that a detailed comparison of classic locks and delegation is out of the scope of this study: the purpose of this section is merely to present global trends and to point out that hardware message passing support changes the landscape of synchronization performance.
Queues and Stacks
Because of their ubiquity, concurrent linearizable queues and stacks are typically used to evaluate the performance of universal synchronization constructions [Hendler et al. 2010; Kallimanis 2011, 2012] . Following this observation, we implement some well-established queues and stacks from the literature and analyze their performance. With these experiments, we study an important use case where CSes are usually short. The implementations store 64-bit values and are evaluated under balanced load. For brevity, we focus only on throughput analysis. The latencies show trends similar to those presented in Section 5.3. As delegation is able to achieve much higher throughput than classic locks, we leave the latter out from the plots to avoid clutter, except for HYBLOCK, as the best performer in that category.
Queues. One of the best-known blocking queues is the fine-grained MSQueue [Michael and Scott 1996] . It is based on a linked list accessed using two CSes, so enqueues and dequeues can take place in parallel. Its performance mostly depends on the way CSes are implemented. We implement MS-Queue using HYBCOMB, CC-SYNCH, the two server-based approaches (which requires two dedicated servers per queue instance), and HYBLOCK. Besides the two-lock version, we implement the same queue using a single lock. We also test LCRQ [Morrison and Afek 2013] , a nonblocking queue that takes advantage of the wide spectrum of atomic operations supported by x86 processors. The TILE-Gx supports most of the necessary instructions, so adapting the LCRQ code written in C for x86 was relatively easy. 4 The queue performance is shown in Figure 10(a) . The single-lock MS-Queues ("-1" suffix in the legend) perform best. Among them, MP-SERVER and HYBCOMB are most efficient: they respectively obtain up to 2x and 1.5x higher throughput than the third best implementation. LCRQ, as well as the two-lock versions of MS-Queue, 5 level off sooner than the rest, which we now explain in more detail. One might expect fine-grained locking to always outperform a coarse lock. However, fine-grained locking involves a trade-off, since the additional synchronization that it includes might outweigh the gain that comes from increasing parallelism [Hendler et al. 2010] . Given Tilera's relaxed memory model, the enqueue and dequeue methods of the two-lock queue must be carefully coded if they can run in parallel-memory fences are necessary to ensure queue consistency. When delegation is used, it turns out that the necessity of inserting fences far outweighs the benefit from fine-grained access. Therefore, a simple sequential queue implemented using MP-SERVER or HYBCOMB yields the best results. On the other hand, fine-grained synchronization pays off when classic locks are used, as HYBLOCK-2 yields an almost twofold performance increase over HYBLOCK-1 (not shown). This is because synchronization is much more expensive in this case, and halving its cost by moving to two locks is not canceled out by the additional fences.
Despite its excellent performance on x86 [Morrison and Afek 2013] , LCRQ is less efficient on the TILE-Gx and achieves performance similar to that of the HYBLOCK queue. We speculate that this is primarily because of the way atomic instructions work on this processor. Namely, L2 caches are in charge of executing them. This means that two atomic instructions might collide on an L2 cache even if they have independent data sets, leading to frequent false serialization. A better performance might be achievable by optimizing LCRQ with the cache hierarchy of the TILE-Gx in mind, but this falls outside the scope of this study.
Stacks. The stack is known to be hard to parallelize, as both push and pop operations access its top. One way to obviate its seemingly inherent sequential nature is to use the elimination technique [Shavit and Touitou 1995; Calciu et al. 2013] : if a push and pop operation are executed concurrently, they can be eliminated to avoid accessing the stack. Still, if an operation cannot be eliminated, it has to access the top of the stack. As elimination is orthogonal to the content of this article, we evaluate the performance of a nonelimination concurrent stack (which, of course, can be used to back up an elimination-based stack).
We evaluate six implementations: a sequential linked-list based stack, turned concurrent using MP-SERVER, HYBCOMB, CC-SYNCH, SHM-SERVER, and HYBLOCK, as well as well-known Treiber's nonblocking stack [Treiber 1986 ]. Their performance is given in Figure 10 (b). MP-SERVER and HYBCOMB stacks are again the best performers-and the numbers nearly match those given in Figure 10 (a) for the single-lock MS queue. This is not surprising, as both concurrent objects are represented as linked lists protected by a coarse lock. Treiber stack performance is inferior to that of the other implementations, because the head of the stack is accessed using CAS. This causes growing contention as concurrency increases, as most CAS operations repeatedly fail.
Discussion
One might wonder to what extent our results are processor specific. To answer this question, we have measured the throughput of a concurrent counter implemented using MCS, CC-SYNCH, and SHM-SERVER on two single-socket x86 processors: a 10-core Intel Xeon E7-L8867 (without and with hyperthreading enabled) and a 6-core AMD Opteron 6176. In virtually all of the cases, peak throughput is significantly lower on the x86. For delegation, we have also measured the number of stalls per operation of the servicing thread (as in Figure 8(b) ) and got proportionally larger numbers than on the TILE-Gx. Therefore, we believe that HYBLOCK, MP-SERVER, and HYBCOMB would outperform their shared-memory-only counterparts also on x86 hardware, if it provided native message passing support. Moreover, since there are more stalls on the x86, the potential performance improvement is higher.
Still, it is noteworthy that we observed some platform-specific effects. Since the implementation of atomic instructions differs on the TILE-Gx and the x86, algorithms that use them intensively (typically nonblocking ones) may behave differently. This is visible on the example of LCRQ, which has substantially higher throughput on the x86 processors than on the TILE-Gx. In addition, because of the different memory consistency model, two-lock MS-Queue outperforms its one-lock counterpart on the Xeon and Opteron (see Figure 10 (a)), in contrast to what we have observed on the TILE-Gx. Note, however, that these differences are specific to implementations of a certain concurrent object-a queue in this case. In other words, Figure 10 (a) (showing queue performance) would look different on an x86, but the qualitative advantage of HYBLOCK, MP-SERVER, and HYBCOMB over the shared-memory constructions for mutual exclusion, which is central to this article, would in all likelihood remain the same.
Finally, the advantage provided by MP-SERVER and HYBCOMB is due to the way hardware message passing is implemented, and more specifically to the fact that receive operations read from a local buffer and that send operations are asynchronous. These features are not too specific, and so we believe that they can be easily provided by future implementations of hardware message passing. Note also that HYBCOMB depends a lot on the performance of the fetch-and-add instruction, since every client must execute it on the same variable before sending a request to the current combiner. Fetch-and-add on x86 processors is typically fast and scalable, at least in single-socket execution, as it is guaranteed to succeed [Morrison and Afek 2013] .
ADDITIONAL CONSIDERATIONS
This section discusses some practical aspects of our message-passing approaches.
Oversubscribing and thread migration. The results presented in Section 5 assume a uniprogrammed environment, with at most one thread pinned to a core. This is not an inherent limitation of the hardware message passing approaches.
Indeed, on the TILE-Gx, oversubscribing is easily achieved thanks to the possibility to multiplex the hardware queue of each core (see Section 5.1), which means that up to four threads can share a core and still have their exclusive message queue. With HYBLOCK, MP-SERVER, and HYBCOMB, application threads can freely migrate to another core in between requests, as long as they are able to reserve a hardware queue on that core. Upon making a request, a thread t is only expected to have a valid identifier, corresponding to its current core and hardware queue. As long as t remains pinned to the current core while its request is pending, other threads will be able to reach it using that identifier.
More generally speaking, any constraints related to thread migration and oversubscribing can be easily solved with relatively straightforward support at the hardware and OS level. For instance, the asynchronous direct messages (ADM) mechanism [Sanchez et al. 2010 ] resembles Tilera's message-passing hardware in many ways, but in addition includes a small associative memory that caches (thread ID, core) pairs. This enables threads to migrate freely while the OS keeps track of thread-to-core mappings.
Deadlocks. Bearing in mind the limited capacity of the hardware message queues, another practical issue with message passing is the possibility of deadlocks, if messages back up in the network and block the sender. Obviously, there are no such problems with HYBLOCK, as a core's message queue contains at most two messages at any point during algorithm execution. Additionally, the message queues of MP-SERVER clients or HYBCOMB noncombiner threads cannot overflow, as they contain at most one message. Therefore, the servicing thread never blocks when sending a response to a request.
In our experiments, the message queue of a servicing thread cannot overflow, as it contains at most 35 three-word requests at any time, which fits in the message queue. More generally, overflows can happen if the hardware queue is not big enough to keep one request per application thread. In this case, some clients could be blocked when sending a request, but this is not an issue since every such send is anyway immediately followed by a blocking receive.
RELATED WORK
In Section 3, we detailed locks and delegation as ways to implement mutual exclusion. This section gives an overview of work studying message passing in the manycore context.
Due to the uncertain future of cache coherence, a great body of recent work studies manycores provided with hardware message passing such as the Tilera [2014] and the Intel SCC [Howard et al. 2010] . It has been shown that message passing can help in achieving good performance in the implementation of transactional memory [Gramoli et al. 2012] and key-value stores [Berezecki et al. 2011] . In the 1990s, Herlihy et al. [1995] showed, by simulating MIT's Alewife processor, that message-passing implementations of counting networks and combining trees are more efficient than their shared-memory counterparts. In this article, we leverage message passing to efficiently implement an arbitrary concurrent object through universal constructions.
Some recent work also considers hardware augmentations for efficient mutual exclusion: token-based messaging over a dedicated network [Abellán et al. 2011 ] and a custom instruction set and dedicated cores [Suleman et al. 2010] . Our work complements these studies by considering an off-the-shelf processor with generic hardware support for message passing and providing synchronization completely in software.
Finally, similarly to RCL [Lozi et al. 2012] , recent work implements message passing over shared memory in the context of concurrent objects [Metreveli et al. 2012; Calciu et al. 2013] , because of the explicit control over communication and improved data locality it provides. Our results show that in this case, performance is still limited by the underlying cache coherence protocol, and that hardware message passing can provide a performance improvement.
CONCLUSION
Considering the problem of executing CSes, we studied how hardware message passing can be used for efficient thread synchronization. We proposed three generic constructions tailored to take advantage of hardware message passing: HYBLOCK; a hybrid lock, MP-SERVER, a server-based approach; and HYBCOMB, a hybrid combining algorithm. Experiments on Tilera's TILE-Gx processor show that HYBLOCK, MP-SERVER, and HYBCOMB largely outperform their shared-memory-only counterparts when used to implement ubiquitous linearizable concurrent objects (counters, queues, stacks).
Our results show that hardware message passing can provide more efficient thread synchronization and thus improve the scalability of concurrent code. The hybrid design of HYBLOCK and HYBCOMB demonstrates that processors providing both CC shared memory and message passing are appealing, as they allow us to take the best of both worlds. However, it also illustrates that significant algorithmic effort can be necessary to exploit the resources of a hybrid machine.
