Remote Memory Access (RMA) is an emerging mechanism for programming high-performance computers and datacenters. However, little work exists on resilience schemes for RMA-based applications and systems. In this paper we analyze fault tolerance for RMA and show that it is fundamentally different from resilience mechanisms targeting the message passing (MP) model. We design a model for reasoning about fault tolerance for RMA, addressing both flat and hierarchical hardware. We use this model to construct several highly-scalable mechanisms that provide efficient low-overhead in-memory checkpointing, transparent logging of remote memory accesses, and a scheme for transparent recovery of failed processes. Our protocols take into account diminishing amounts of memory per core, one of the major features of future exascale machines. The implementation of our fault-tolerance scheme entails negligible additional overheads. Our reliability model shows that inmemory checkpointing and logging provide high resilience. This study enables highly-scalable resilience mechanisms for RMA and fills a research gap between fault tolerance and emerging RMA programming models.
INTRODUCTION
Partitioned Global Address Space (PGAS), and the wider class of Remote Memory Access (RMA) programming models enable high-performance communications that often outperform Message Passing [19, 34] . RMA utilizes remote direct memory access (RDMA) hardware features to access Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. memories at remote processes without involving the OS or the remote CPU.
RDMA is offered by most modern HPC networks (InfiniBand, Myrinet, Cray's Gemini and Aries, IBM's Blue Gene, and PERCS) and many Ethernet interconnects that use the RoCE or iWARP protocols. RMA languages and libraries include Unified Parallel C (UPC), Fortran 2008 (formerly known as CAF), MPI-3 One Sided, Cray's SHMEM interface, or Open Fabrics (OFED). Thus, we observe that RMA is quickly emerging to be the programming model of choice for cluster systems, HPC computers, and large datacenters.
Fault tolerance of such systems is important because hardware and software faults are ubiquitous [38] . Two popular resilience schemes used in today's computing environments are coordinated checkpointing (CC) and uncoordinated checkpointing augmented with message logging (UC) [17] . In CC applications regularly synchronize to save their state to memory, local disks, or parallel file system (PFS) [38] ; this data is used to restart after a crash. In UC processes take checkpoints independently and use message logging to avoid rollbacks caused by the domino effect [37] . There has been considerable research on CC and UC for the message passing (MP) model [6, 17] . Still, no work addresses the exact design of these schemes for RMA-based systems.
In this work we develop a generic model for reasoning about resilience in RMA. Then, using this model, we show that CC and UC for RMA fundamentally differ from analogous schemes for MP. We also construct protocols that enable simple checkpointing and logging of remote memory accesses. We only use in-memory mechanisms to avoid costly I/O flushes and frequent disk and PFS failures [24, 38] . We then extend our model to cover two features of today's petascale and future exascale machines: (1) the growing complexity of hardware components and (2) decreasing amounts of memory per core. With this, our study fills an important knowledge gap between fault-tolerance and emerging RMA programming in large-scale computing systems.
In detail, we provide the following major contributions:
• We design a model for reasoning about the reliability of RMA systems running on flat and hierarchical hardware with limited memory per core. To our knowledge, this is the first work that addresses these issues.
• We construct schemes for in-memory checkpointing, logging, and recovering RMA-based applications.
• We unify these concepts in a topology-aware diskless protocol and we use real data and an analytic model to show that the protocol can endure concurrent hardware failures. Table 1 : Categorization of MPI One Sided/UPC/Fortran 2008 operations in our model. Some atomic functions are considered as both puts and gets. In UPC, the collectives, assignments and upc memset/upc memcpy behave similarly depending on the values of pointers to shared objects; the same applies to Fortran 2008. We omit MPI's post-start-complete-wait synchronization and request-based RMA operations for simplicity.
• We present the implementation of our protocol, analyze its performance, show it entails negligible overheads, and compare it to other schemes.
RMA PROGRAMMING
We now discuss concepts of RMA programming and present a formalization that covers existing RMA/PGAS models with strict or relaxed memory consistency (e.g., UPC or MPI-3 One Sided). In RMA, each process explicitly exposes an area of its local memory as shared. Memory can be shared in different ways (e.g., MPI windows, UPC shared arrays, or Co-Arrays in Fortran 2008); details are outside the scope of this work. Once shared, memory can be accessed with various language-specific operations.
RMA Operations
We identify two fundamental types of RMA operations: communication actions (often called accesses; they transfer data between processes), and synchronization actions (synchronize processes and guarantee memory consistency). A process p that issues an RMA action targeted at q is called the active source, and q is called the passive target. We assume p is active and q is passive (unless stated otherwise).
Communication Actions
We denote an action that transfers data from p to q and from q to p as put(p ⇒ → q) and get(p ⇐ → q), respectively. We use double-arrows to emphasize the asymmetry of the two operations: the upper arrow indicates the direction of data flow and the lower arrow indicates the direction of control flow. The upper part of Table 1 categorizes communication operations in various RMA languages. Some actions (e.g., atomic compare and swap) transfer data in both directions and thus fall into the family of puts and gets.
We also distinguish between puts that "blindly" replace a targeted memory region at q with a new value (e.g., UPC assignment), and puts that combine the data moved to q with the data that already resides at q (e.g., MPI Accumulate). When necessary, we refer to the former type as the replacing put, and to the latter as the combining put.
Memory Synchronization Actions
We identify four major categories of memory synchronization actions: lock(p → q, str) (locks a structure str in q's memory to provide exclusive access), unlock(p → q, str) (unlocks str in q's memory and enforces consistency of str), flush(p → q, str) (enforces consistency of str in p's and q's memories), and gsync(p → ◇, str) (enforces consistency of str); ◇ indicates that the call targets all processes. Arrows indicate the flow of control (synchronization). When we refer to the whole process memory (and not a single structure), we omit str (e.g., lock(p → q)). The lower part of Table 1 categorizes synchronization calls in various RMA languages.
Epochs and Consistency Order
RMA's relaxed memory consistency enables non-blocking puts and gets. Issued operations are completed by memory consistency actions (flush, unlock, gsync). The period between any two such actions issued by p and targeting the same process q is called an epoch. Every unlock(p → q) or flush(p → q) closes p's current epoch and opens a new one (i.e., increments p's epoch number denoted as E(p → q)). p can be in several independent epochs related to each process that it communicates with. As gsync is a collective call, it increases epochs at every process.
An important concept related to epochs is the consistency order (denoted as co →). 
Program, Synchronization, and Happened Before Orders
In addition to co → we require three more orders to specify an RMA execution [22] : The program order ( po →) specifies the order of actions of a single thread, similarly to the program order in Java [29] (x po → y means that x is called before y by some thread). The synchronization order ( so →) orders lock and unlock and other synchronizing operations. Happened-before (HB, hb →), a relation well-known in message passing [27] , is the transitive closure of the union of po → and so →. We abbreviate a consistent happen-before as
To state that actions are parallel in an order, we use the symbols po, so, hb . We show the orders in Fig. 2 ; more details can be found in [22] . 
Formal Model
We now combine the various RMA concepts and fault tolerance into a single formal model. We assume fail-stop faults (processes can disappear nondeterministically but behave correctly while being a part of the program). The data communication may happen out of order as specified for most RMA models. Communication channels between nonfailed processes are asynchronous, reliable, and error-free. The user code can only communicate and synchronize using RMA functions specified in Section 2.1. Finally, checkpoints and logs are stored in volatile memories.
We define a communication action a as a tuple a = ⟨type, src, trg, combine, EC, GC, SC, GN C, data⟩
where type is either a put or a get, src and trg specify the source and the target, and data is the data carried by a. Combine determines if a is a replacing put (combine = f alse) or a combining put (combine = true). EC (Epoch Counter ) is the epoch number in which a was issued. GC, SC, and GN C are counters required for correct recovery; we discuss them in more detail in Section 4. We combine the notation from Section 2.1 with this definition and write put(p ⇒ → q).EC to refer to the epoch in which the put happens. We also define a determinant of a (denoted as #a, cf. [6] ) to be tuple a without data: #a = ⟨type, src, trg, combine, EC, GC, SC, GN C⟩.
Similarly, a synchronization action b is defined as b = ⟨type, src, trg, EC, GC, SC, GN C, str⟩.
Finally, a trace of an RMA program running on a distributed system can be written as the tuple
where P is the set of all Processes in D ( P = N ), E = A ∪ I is the set of all Events: A is the set of RMA Actions, I is the set of Internal actions (reads, writes, checkpoint actions). read(x, p) loads local variable x and write(x ∶= val, p) assigns val to x (in p's memory). C 
FAULT-TOLERANCE FOR RMA
We now present schemes that make RMA codes fault tolerant. We start with the simpler CC and then present the protocols for UC.
Coordinated Checkpointing (CC)
In many CC schemes, the user explicitly calls a function to take a checkpoint. Such protocols may leverage RMA's features (e.g., direct memory access) to improve the performance. However, these schemes have several drawbacks: they complicate the code because they can only be called when the network is quiet [21] and they do not always fit the optimality criteria such as Daly's checkpointing interval [15] . In this section, we first identify how CC in RMA differs from CC in MP and then describe a scheme for RMA codes that performs CC transparently to the application. We model a coordinated checkpoint as a set C = {C
, pm ≠ pn for any m, n.
RMA vs. MP: Coordinated Checkpointing
In MP, every C has to satisfy a consistency condition [21] :
This condition ensures that C does not reflect a system state in which one process received a message that was not sent by any other process. We adopt this condition and extend it to cover all RMA semantics:
We extend hb to cohb to guarantee that the system state saved in C does not contain a process affected by a memory access that was not issued by any other process. In RMA, unlike in MP, this condition can be easily satisfied because each process can drain the network with a local flush (enforcing consistency at any point is legal [22] 
Taking a Coordinated Checkpoint
We now propose two diskless schemes that obey the RMAconsistency condition and target MPI-3 RMA codes. The first ("Gsync") scheme can be used in programs that only synchronize with gsyncs. The other ("Locks") scheme targets codes that only synchronize with locks and unlocks. Note that in correct MPI-3 RMA programs gsyncs and locks/unlocks cannot be mixed [31] . All our schemes assume that a gsync may also introduce an additional hb → order, which is true in some implementations [31] .
The "Gsync" Scheme Every process may take a coordinated checkpoint right after the user calls a gsync and before any further RMA calls by: (1) optionally enforcing the global hb → order with an operation such as MPI Barrier (denoted as bar), and taking the checkpoint. Depending on the application needs, not every gsync has to be followed by a checkpoint. We use Daly's formula [15] to compute the best interval between such checkpoints and we take checkpoints after the right gsync calls.
Theorem 3.1. The Gsync scheme satisfies the RMAconsistency condition and does not deadlock.
Proof. We assume correct MPI-3 RMA programs represented by their trace D [22, 31] . For all p, q ∈ P, each gsync(p → ◇) has a matching gsync(q → ◇) such that [gsync(p → ◇) hb gsync(q → ◇)]. Thus, if every process calls bar right after gsync then bar matching is guaranteed and the program cannot deadlock. In addition, the gsync calls introduce a global consistency order co → such that the checkpoint is coordinated and consistent.
The "Locks" Scheme Every process p maintains a local Lock Counter LCp that starts with zero and is incremented after each lock and decremented after each unlock. When LCp = 0, process p can perform a checkpoint in three phases: (1) enforce consistency with a flush(p → ◇), (2) call a bar to provides the global hb → order, and (3) take a checkpoint C i p . The last phase, the actual checkpoint stage, is performed collectively thus all processes can take the checkpoint C in coordination.
Theorem 3.2. The Locks scheme satisfies the RMAconsistency condition and does not deadlock.
Proof. The call to flush(p → ◇) in phase 1 guarantees global consistency at each process. The bar in phase 2 guarantees that all processes are globally consistent before the checkpoint taken in phase 3.
It remains to proof deadlock-freedom. We assume correct MPI-3 RMA programs [22, 31] . A lock(p → q) can only block waiting for an active lock lock(z → q) and no bar can be started at z while the lock is held. In addition, for every lock(z → q), there is a matching unlock(z → q) in the execution such that lock(z → q) po → unlock(z → q) (for any z, p, q ∈ P). Thus, all locks must be released eventually, i.e., ∃a ∈ E ∶ a po → write(LCp ∶= 0, p) for any p ∈ P.
The above schemes show that the transparent CC can be achieved much simpler in RMA than in MP. In MP, such protocols usually have to analyze inter-process dependencies due to sent/received messages, and add protocol-specific data to messages [11, 17] , which reduces the bandwidth. In RMA this is not necessary.
Uncoordinated Checkpointing (UC)
Uncoordinated checkpointing augmented with message logging reduces energy consumption and synchronization costs because a single process crash does not force all other processes to revert to the previous checkpoint and recompute [17, 37] . Instead, a failed process fetches its last checkpoint and replays messages logged beyond this checkpoint. However, UC schemes are usually more complex than CC [17] . We now analyze how UC in RMA differs from UC in MP, followed by a discussion of our UC protocols. Data structures for the protocols are shown in Table 2 .
RMA vs. MP: Uncoordinated Checkpointing
The first and obvious difference is that we now log not messages but accesses. Other differences are as follows:
Storing Access Logs In MP, processes exchange messages that always flow from the sender (process p) to the receiver (process q). Messages can be recorded at the sender's side [17, 37] . During a recovery, the restored process interacts with other processes to get and reply the logged messages (see Figure 3 (part (1)). In RMA, a put(p ⇒ → q) changes the state of q, but a get(p ⇐ → q) modifies the state of p. Thus, put(p ⇒ → q) can be logged in p's memory, but get(p ⇐ → q) cannot because a failure of p would prevent a successful recovery (see Figure 3 , part 2 and 3).
Transparency of Schemes In MP, both p and q actively participate in communication. In RMA, q is oblivious to accesses to its memory and thus any recovery or logging performed by p can be transparent to (i.e., does not obstruct) q (which is usually not the case in MP, cf. [37] ).
No Piggybacking Adding some protocol-specific data to messages (e.g., piggybacking) is a popular concept in MP [17] . Still, it cannot be used in RMA because puts and gets are accesses, not messages. Yet, issuing additional accesses is cheap in RMA.
Access Determinism Recent works in MP (e.g., [20] ) explore send determinism: the output of an application run is oblivious to the order of received messages. In our work we identify a similar concept in RMA that we call access determinism. For example, in race-free MPI-3 programs the application output does not depend on the order in which two accesses a and b committed to memory if a co b.
Orphan Processes In some MP schemes (called optimistic), senders postpone logging messages for performance reasons [17] . Assume q received a message m from p and then sent a message m ′ to r. If q crashes and m is not logged by p at that time, then q may follow a run in that it does not send m ′ . Thus, r becomes an orphan: its state depends on a message m ′ that was not sent [17] (see Figure 4 , part 1).
In RMA, a process may also become an orphan. Consider Figure 4 (part 2). First, p modifies a variable x at q. Then, q reads x and conditionally issues a put(q ⇒ → r). If q crashes and p postponed logging put(p ⇒ → q), then q (while recovering) may follow a run in which it does not issue put(q ⇒ → r); thus r becomes an orphan. 
Taking an Uncoordinated Checkpoint
We denote the ith uncoordinated checkpoint taken by process p as C i p . Taking C i p is simple and entails: (1) locking local application data, (2) sending the copy of the data to some remote volatile storage, and (3) unlocking the application data (we defer the discussion on the implementation details until Section 6). After p takes C i p , any process q can delete the logs of every put(q ⇒ → p) (from LPq[p]) and
We demand that every C i p is taken immediately after closing/opening an epoch and before issuing any new communication operations (we call this the epoch condition). This condition is required because, if p issues a get(p ⇐ → q), the application data is guaranteed to be consistent only after closing the epoch. 
Structure Description

LPp[q] ∈ S
Logs of puts issued by p and targeted at q.
LGq[p] ∈ S Logs of gets targeted at q and issued by p.
LPp ∈ S Logs of puts issued and stored by p and targeted at any other process; LPp ≡ ⋃r∈P∧r≠p LPp[r].
LGq ∈ S Logs of gets targeted and stored at q, issued by any other process;
LGq ≡ ⋃r∈P∧r≠q LGq[r].
Qp ∈ S A helper container stored at p, used to temporarily log #gets issued by p.
Nq[p] ∈ S
A structure (stored at q) that determines whether or not p issued a non-blocking LGq are stored at q.
Transparent Logging of RMA Accesses
We now describe the logging of puts and gets.
Logging Puts To log a put(p ⇒ → q), p first calls lock(p → p, LPp). Self-locking is necessary because there may be other processes being recovered that may try to read LPp. Then, the put is logged (LPp[q] ∶= LPp[q] ∪ {put(p ⇒ → q)}; ":=" denotes the assignment of a new value to a variable or a structure). Finally, p unlocks LPp. Atomicity between logging and putting is not required because, in the weak consistency memory model, the source memory of the put operation may not be modified until the current epoch ends. If the program modifies it nevertheless, RMA implementations are allowed to return any value, thus the logged value is irrelevant. We log put(p ⇒ → q) before closing the epoch put(p ⇒ → q).EC. If the put is blocking then we log it before issuing, analogously to the pessimistic message logging [17] .
Logging Gets We log a get(p ⇐ → q) in two phases to retain its asynchronous behavior (see Algorithm 1). First, we record the determinant #get(p ⇐ → q) in Qp (lines 2-3). We cannot access get(p ⇐ → q).data as the local memory will only be valid after the epoch ends. We avoid issuing an additional blocking flush(p → q), instead we rely on the user's call to end the epoch. Second, when the user ends the epoch, we lock the remote log LGq, record get(p ⇐ → q), and unlock LGq (lines 4-7).
Note that if p fails between issuing get(p ⇐ → q) and closing the epoch, it will not be able to replay it consistently. To address this problem, p sets Nq[p] at process q to true right before issuing the first get(p ⇒ → q) (line 1), and to false after closing the epoch get(p ⇒ → q).EC (line 8). During the recovery, if p notices that any Nq[p] = true, it falls back to another resilience mechanism (i.e., the last coordinated checkpoint). If the get is blocking then we set Nq[p] = f alse after returning from the call. 
CAUSAL RECOVERY FOR UC
We now show how to causally recover a failed process (causally means preserving co →, so →, and hb →). This section describes technical details on how to guarantee all orders to ensure a correct access replay. If the reader is not interested in all details, she may proceed to Section 5 without disrupting the flow. A causal process recovery has three phases: (1) fetching uncoordinated checkpoint data, (2) replaying accesses from remote logs, and (3) in case of a problem during the replay, falling back to the last coordinated checkpoint. We first show how we log the respective orderings between accesses (Section 4.1) and how we prevent replaying some accesses twice (Section 4.2). We finish with our recovery scheme (Section 4.3) and a discussion (Section 4.4). Due to space constraints, we include full proofs in the techreport version of the paper 1 .
Logging Order Information
We now show how to record so →, hb →, and co →. For clarity, but without loss of generality, we separately present several scenarios that exhaust possible communication/synchronization patterns in our model. We consider three processes (p, q, r) and we analyze what data is required to replay q. We show each pattern in Figure 5 .
A. Puts and Flushes First, p and r issue puts and flushes at q. At both p and r, puts separated by flushes are ordered with co →. This order is preserved by recording epoch counters (.EC) with every logged put(p ⇒ → q). Note that, however, RMA semantics do not order calls issued by p and r: [put(p ⇒ → q) co put(r ⇒ → q)] without additional process synchronization. Here, we assume access determinism: the recovery output does not depend on the order in which such puts committed in q's memory.
B. Gets and Flushes Next, q issues gets and flushes targeted at p and r. Again, co → has to be logged. However, this time gets targeted at different processes are ordered (because they are issued by the same process). To log this ordering, q maintains a local Get Counter GCq that is incremented each time q issues a flush(q → ◇) to any other process. The value of this counter is logged with each get using the field .GC (cf. Section 2.4).
C. Puts and Locks In this scenario p and r issue puts at q and synchronize their accesses with locks and unlocks. This pattern requires logging the so → order. We achieve this with a Synchronization Counter SCq stored at q. After issuing a lock(p → q), p (the same refers to r) fetches the value of SCq, increments it, updates remote SCq, and records it with every put using the field .SC (cf. Section 2.4). In addition, this scenario requires recording co → that we solve with .EC, analogously as in the "Puts and Flushes" pattern.
D 
Preventing Replaying Accesses Twice
Assume that process p issues a put(p ⇒ → q) (immediately logged by p in LPp[q]) such that put(p ⇒ → q) co → C j q . It means that the state of q recorded in checkpoint C j q is affected by put(p ⇒ → q). Now assume that q fails and begins to replay the logs. If p did not delete the log of put(p ⇒ → q) from LPp[q] (it was allowed to do it after q took C j q ), then q replays put(p ⇒ → q) and this put affects its memory for the second time. This is not a problem if put(p ⇒ → q).combine = f alse, because such a put always overwrites the memory region with the same value. However, if put(p ⇒ → q).combine = true, then q ends up in an inconsistent state (e.g., if this put increments a memory cell, this cell will be incremented twice).
To solve this problem, every process p maintains a local structure Mp[q] ∈ S. When p issues and logs a put(p ⇒ → q) such that put(p ⇒ → q).combine = true, it sets
Recovering a Failed Process
We now describe a protocol for codes that synchronize with gsyncs; consult the technical report for other schemes. Let us denote the failed process as p f . We assume an underlying batch system that provides a new process pnew in the place of p f , and that other processes resume their communication with pnew after it fully recovers. We illustrate the scheme in Algorithm 2. First, pnew fetches the checkpointed data. Second, pnew gets the logs of puts (put logs) and gets (get logs) related to p f (lines 3-11). It also checks if any
if yes it instructs all processes to roll back to the last coordinated checkpoint. The protocol uses locks (lines 5,10) to prevent data races due to, e.g., concurrent recoveries and log cleanups by q.
The main part (lines 12-26) replays accesses causally. The recovery ends when there are no logs left (line 12; logs is the size of the set "logs"). We first get the logs with the smallest .GN C (line 13) to maintain cohb → introduced by gsyncs (see § 4.1 E). Then, within this step, we find the logs with minimum .EC and .GC to preserve co → in issued puts and gets, respectively (lines 17-18, see § 4.1 A, B). We replay them in lines 19-20.
Discussion
Our recovery scheme presents a trade-off between memory efficiency and time to recover. Process pnew fetches all related logs and only then begins to replay accesses. Thus, we assume that its memory has capacity to contain put logs and get logs; a reasonable assumption if the user program has regular communication patterns (true for most of today's RMA applications [19] ). A more memory-efficient scheme fetches logs while recovering. This incurs performance issues as pnew has to access remote logs multiple times.
EXTENDING THE MODEL FOR MORE RESILIENCE
Our model and in-memory resilience schemes are oblivious to the underlying hardware. However, virtually all of today's systems have a hierarchical hardware layout (e.g., cores reside on a single chip, chips reside in a single node, nodes form a rack, and racks form a cabinet). Multiple elements may be affected by a single failure at a higher level, jeopardizing the safety of our protocols. We now extend our model to cover arbitrary hierarchies and propose topology-aware mechanisms to make our schemes handle concurrent hardware failures. Specifically, we propose three following extensions:
The Hierarchy of Failure Domains A failure domain (FD) is an element of a hardware hierarchy that can fail (e.g., a node or a cabinet). FDs constitute an FD hierarchy (FDH) with h levels. An example FDH is shown in Figure 6 , h = 4. We skip the level of single cores because in practice the smallest FD is a node (e.g., in the TSUBAME2.0 system failure history, there are no core failures [3] ). Then, we define H = ⋃ 1≤j≤h ⋃ 1≤i≤H j Hi,j to be the set of all the FD elements in an FDH. Hi,j and Hj are element i of hierarchy level j and the number of such elements at level j, respectively. For example, in Figure 6 H3,2 is the third blade (level 2) and H2 = 96. 
Groups of Processes
To improve resilience, we split the process set P into g equally-sized groups Gi and add m checksum processes to each group to store checksums of checkpoints taken in each group (using, e.g., the ReedSolomon [36] coding scheme). Thus, every group can resist m concurrent process crashes. The group size is G = P g +m. New System Definition We now extend the definition of a distributed system D to cover the additional concepts:
G = {G1, ..., Gg} is a set of Groups of processes and M ∶ P × N → H is a function that M aps process p to the FD at hierarchy level k where p runs: M (p, k) = H j,k . M defines how processes are distributed over FDH. For example, if p runs on blade H1,2 from Figure 6 , then M (p, 2) = H1,2.
Handling Multiple Hardware Failures
More than m process crashes in any group Gi result in a catastrophic failure (CF; we use the name from [8] ) that incurs restarting the whole computation. Depending on how M distributes processes, such a CF may be caused by several (or even one) crashed FDs. To minimize the risk of CFs, M has to be topology-aware (t-aware): for a given level n (called a t-awareness level ), no more than m processes from the same group can run on the same H i,k at any level k, k ≤ n: Figure 7 shows an example t-aware process distribution. 
Calculating Probability of a CF
We now calculate the probability of a catastrophic failure (P cf ) in our model. We later ( § 7.1) use P cf to show that our protocols are resilient on a concrete machine (the TSUMABE2.0 supercomputer [3] ). If a reader is not interested in the derivation details, she may proceed to Section 6 where we present the results. We set m = 1 and thus use the XOR erasure code, similar to an additional disk in a RAID5 [12] . We assume that failures at different hierarchy levels are independent and that any number xj of elements from any hierarchy level j (1 ≤ xj ≤ Hj, 1 ≤ j ≤ h) can fail. Thus,
Pj (xj )Pj (x j,cf xj ). (7) P (xj ∩ x j,cf ) is the probability that xj elements of the j hierarchy level will fail and result in a catastrophic failure. Pj(xj) is the probability of the failure of xj elements from level j of the hierarchy. Pj(x j,cf xj) is the probability that xj given concurrent failures at hierarchy level j are catastrophic to the system. It is difficult to analytically derive Pj(xj) as it is specific for every machine. For our example study (see Section 7.1) we use the failure rates from the TSUBAME2 failure history [3] .
In contrast, Pj(x j,cf xj) can be calculated using combinatorial theory. Assume that M distributes processes in a t-aware way at levels 1 to n of the FDH (1 ≤ n ≤ h). First, we derive Pj(x j,cf xj) for any level j such that 1 ≤ j ≤ n:
G 2 is the number of the possible catastrophic failure scenarios in a single group (m = 1 thus any two process crashes in one group are catastrophic). Dj is the number of such single-group scenarios at the whole level j and is equal to ⌈ H j G ⌉ (see Figure 8 for intuitive explanation).
is the number of the remaining possible failure scenarios and
is the total number of the possible failure scenarios. Second, for remaining levels j (n + 1 ≤ j ≤ h) M is not t-aware and thus in the worst-case scenario any element crash is catastrophic: Pj(x j,cf xj) = 1. The final formula for P cf is thus
HOLISTIC RESILIENCE PROTOCOL
We now describe an example conceptual implementation of holistic fault tolerance for RMA that we developed to understand the tradeoffs between the resilience and performance in RMA-based systems. We implement it as a 2) Consider three process distribution scenarios by M (each is t-aware). Optimistically, processes can be distributed contiguously (scenario A) or partially fragmented (scenario B). To get the upper bound for P cf we use the worst-case pattern (scenario C). Now, to get the number of single-group CF scenarios at the whole level j (Dj ), we need to obtain the number of the groups of hardware elements at j that hold process groups: ⌈Hj G ⌉.
portable library (based on C and MPI) called ftRMA. We utilize MPI-3's one sided interface, but any other RMA model enabling relaxed memory consistency could be used instead (e.g., UPC or Fortran 2008). We use the publicly available foMPI implementation of MPI-3 one sided as MPI library [1] but any other MPI-3 compliant library would be suitable. For simplicity we assume that the user application uses one contiguous region of shared memory of the same size at each process. Still, all the conclusions drawn are valid for any other application pattern based on RMA. Following the MPI-3 specification, we call this shared region of memory at every process a window. Finally, we divide user processes (referred to as CoMputing processes, CM s) into groups (as described in Section 5) and add one CHecksum process (denoted as CH) per group (m = 1). For any computing process p, we denote the CH in its group as CH(p). CHs store and update XOR checksums of their CM s.
Protocol Overview
In this section we provide a general overview of the layered protocol implementation (see Figure 9 ). The first part (layer 1) logs accesses. The second layer takes uncoordinated checkpoints (called demand checkpoints) to trim the logs. Layer 3 performs regular coordinated checkpoints. All layers are diskless. Causal recovery replays memory accesses. Finally, our FDH increases resilience of the whole protocol. Daly's Interval Layer 3 uses Daly's formula [15] as the optimum interval between coordinated checkpoints:
. M is the MTBF (mean time between failures that ftRMA handles with coordinated checkpointing) for the target machine and δ is the time to take a checkpoint.
The user provides M while δ is estimated by our protocol.
Interfacing with User Programs and Runtime ftRMA routines are called after each RMA action. This would entail runtime system calls in compiled languages and we use the PMPI profiling interface [31] in our implementation. During window creation the user can specify: (1) the number of CHs, (2) MTBF, (3) whether to use topologyawareness. After window creation, the protocol divides processes into CM s and CHs. If the user enables t-awareness, groups of processes running on the same FDs are also created. In the current version ftRMA takes into account computing nodes when applying t-awareness.
Demand Checkpointing
Demand checkpoints address the problem of diminishing amounts of memory per core in today's and future computing centers. If free memory at CM process p is scarce, p selects the process q with the largest LPp[q] or LGp[q] and requests a demand checkpoint. First, p sends a checkpoint request to CH(q) which, in turn, forces q to checkpoint. This can be done by: closing all the epochs, locking all the relevant data structures, calculating the XOR checksum, and: (1) streaming the result to CH(q) piece by piece or (2) sending the result in one bulk. CH(q) integrates the received checkpoint data into the existing XOR checksum. Variant (1) is memory-efficient, and (2) is less time-consuming. Next, q unlocks all the data structures. Finally, CH(q) sends a confirmation with the epoch number E(p → q) and respective counters (GN Cq, GCq, SCq) to p. Process p can delete logs of actions a where a.EC < E(p → q), a.GN C < GN Cq, a.GC < GCq, a.SC < SCq.
TESTING AND EVALUATION
In this section we first analyze the resilience of our protocol using real data from TSUBAME2.0 [3] failure history. Then, we test the performance of ftRMA with a NAS benchmark [14] that computes 3D Fast Fourier Transformation and a distributed key-value store. We denote the number of CHs and CM s as CH and CM , respectively.
Analysis of Protocol Resilience
Our protocol stores all data in volatile memories to avoid I/O performance penalties and frequent disk and parallel file system failures [24, 38] . This brings several questions on whether the scheme is resilient in practical environments. To answer this question, we calculate the probability of a catastrophic failure P cf (using Equations (7) and (9)) of our protocol, applying t-awareness at different levels of FDH.
We first fix model parameters (Hj, h) to reflect the hierarchy of TSUBAME2.0. TSUBAME2.0 FDH has 4 levels [38] : nodes, power supply units (PSUs), edge switches, and racks (h = 4) [38] . Then, to get P cf , we calculate distributions Pj(xj) that determine the probability of xj concurrent crashes at level j of the TSUBAME FDH. To obtain Pj(xj) we analyzed 1962 crashes in the history of TSUBAME2.0 failures [3] . Following [8] we decided to use exponential probability distributions, where the argument is the number of concurrent failures xj. We derived four probability density functions (PDFs) that approximate the failure distributions of nodes (0.30142 ⋅ 10 −2 e −1.3567x 1 ), PSUs (1.1836 ⋅ 10 −4 e −1.4831x 2 ), switches (3.9249 ⋅ 10 −5 e −1.5902x 3 ), and racks (3.2257 ⋅ 10 −5 e −1.5488x 4 ). The unit is failures per day. Figures 10a and 10b illustrate two PDF plots with histograms. The distributions for PSUs, switches, and racks are based on real data only. For nodes it was not always possible to determine the exact correlation of failures. Thus, we pessimistically assumed (basing on [8] ) that single crashes constitute 75% of all node failures, two concurrent crashes constitute 20%, and other values decrease exponentially. Figure 10c shows the resilience of our protocol when using five t-awareness strategies. The number of processes N is 4,000. P cf is normalized to one day period. Without t-awareness (no-topo) a single crash of any FD of TSUB-AME2.0 is catastrophic, thus P cf does not depend on CH . In other scenarios every process from every group runs on a different node (nodes), PSU (PSUs), switch enclosure (switches) and rack (racks). In all cases P cf decreases proportionally to the increasing CH , however at some point the exponential distributions (Pj(xj)) begin to dominate the results. Topology-awareness at higher hierarchy levels significantly improves the resilience of our protocol. For example, if CH = 5%N , P cf in the switches scenario is ≈4 times lower than in nodes. Furthermore, all t-aware schemes are 1-3 orders of magnitude more resilient than no-topo.
Comparison of Resilience
The results show that even a simple scheme (nodes) significantly improves the resilience of our protocol that performs only in-memory checkpointing and logging. We conclude that costly I/O flushes to the parallel file system (PFS) are not required for obtaining a high level of resilience. On the contrary, such flushes may even increase the risk of failures. They usually entail stressing the I/O system for significant amounts of time [38] , and stable storage is often the element most susceptible to crashes. For example, a Blue Gene/P supercomputer had 4,164 disk fail events in 2011 (for 10,400 total disks) [24] , and its PFS failed 77 times, almost two times more often than other hardware [24] .
Analysis of Protocol Performance
We now discuss the performance of our fault tolerance protocol after the integration with two applications: NAS 3D FFT and a distributed key-value store. Both of these applications are characterized by intensive communication patterns, thus they demonstrate worst-case scenarios for our protocol. Integrating ftRMA with the application code was trivial and required minimal code changes resulting in the same code complexity.
Comparison to Scalable Checkpoint/Restart We compare ftRMA to Scalable Checkpoint-Restart (SCR) [2] , a popular open-source message passing library that provides checkpoint and restart capability for MPI codes but does not enable logging. We turn on the XOR scheme in SCR and we fix the size of SCR groups [2] so that they match the analogous parameter in ftRMA ( G ). To make the comparison fair, we configure SCR to save checkpoints to both in-memory tmpfs (SCR-RAM) and to the PFS (SCR-PFS).
Comparison to Message Logging To compare the logging overheads in MP and RMA we also developed a simple message logging (ML) scheme (basing on the protocol from [37] ) that records accesses. Similarly to [37] we use additional processes to store protocol-specific access logs; the data is stored at the sender's or receiver's side depending on the type of operation.
We execute all benchmarks on the Monte Rosa system and we use Cray XE6 computing nodes. Each node contains four 8-core 2.3 GHz AMD Opterons 6276 (Interlagos) and is connected to a 3D-Torus Gemini network. We use the Cray Programming Environment 4.1.46 to compile the code.
NAS 3D Fast Fourier Transformation
Our version of the NAS 3D FFT [14] benchmark is based on MPI-3 nonblocking puts (we exploit the overlap of computation and communication). The benchmark calculates 3D FFT using a 2D decomposition.
Performance of Coordinated Checkpointing We begin with evaluating our checkpointing "Gsync" scheme. Figure 10d illustrates the performance of NAS FFT faultfree runs. We compare: the original application code without any fault-tolerance (no-FT), ftRMA, SCR-RAM, and SCR-PFS. We fix CH = 12.5% CM . We include two ftRMA scenarios: f-daly (use Daly's formula for coordinated checkpoints), and f-no-daly (fixed frequency of checkpoints without Daly's formula, ≈2.7s for 1024 processes). We use the same t-awareness policy in all codes (nodes). The tested schemes have the respective fault-tolerance overheads over the baseline no-FT: 1-5% (f-daly), 1-15% (f-no-daly), 21-37% (SCR-RAM) and 46-67% (SCR-PFS). The performance of SCR-RAM is lower than f-daly and f-no-daly because ftRMA is based on the Gsync scheme that incurs less synchronization. SCR-PFS entails the highest overheads due to costly I/O flushes.
Performance of Demand Checkpointing
We now analyze how the size of the log impacts the number of demand checkpoints and the performance of fault-free runs (see Figure 11a) . Dedicating less than 44 MiB of memory for storing logs (per process) triggers demand checkpoint requests to clear the log; checkpoints are taken every ≈0.25s on average (when the size of the log is 36 MiB). This results in performance penalties but leaves more memory available to the the user.
Performance of Access Logging As the next step we evaluate our logging scheme. Figure 11b illustrates the performance of fault-free runs. We compare no-FT, ftRMA, and our ML protocol (ML). ftRMA adds only ≈8-9% of overhead to the baseline (no-FT) and consistently outperforms ML by ≈9% due to the smaller amount of protocol-specific interaction between processes.
Varying |CH| and T-Awareness Policies Here, we analyze how CH and t-awareness impact the performance of NAS FFT fault-free runs. We set CH = 12.5% CM and CH = 6.25% CM , and we use the no-topo and nodes tawareness policies. The results show that all these schemes differ negligibly from no-FT by 1-5%.
Key-Value Store
Our key-value store is based on a simple distributed hashtable (DHT) that stores 8-Byte integers. The DHT consists of parts called local volumes constructed with fixedsized arrays. Every local volume is managed by a different process. Inserts are based on MPI-3 atomic Compare-AndSwap and Fetch-And-Op functions. Elements after hash collisions are inserted in the overflow heap that is the part of each local volume. To insert an element, a thread atomically updates the pointers to the next free cell and the last element in the local volume. Memory consistency is ensured with flushes. One get and one put are logged if there is no hash collision, otherwise 6 puts and 4 gets are recorded.
Performance of Access Logging We now measure the relative performance penalty of logging puts and gets. During the benchmark, processes insert random elements with random keys. We focus on inserts only as they are perfectly representative for the logging evaluation. To simulate realistic requests, every process waits for a random time after every insert. The function that we use to calculate this interval is based on the exponential probability distribution: f δe −δx , where f is a scaling factor, δ is a rate parameter and x ∈ [0; b) is a random number. The selected parameter values ensure that every process spends ≈5-10% of the total runtime on inserting elements. For many computation-intense applications this is already a high amount of communication.
We again compare no-FT, ML, and two ftRMA scenarios: f-puts (logging only puts) and f-puts-gets (logging puts and gets). We fix CH = 12.5% CM and use the nodes t-awareness. We skip SCR as it does not enable logging. We present the results in Figure 11c . For N = 256, the logging overhead over the baseline (no-FT) is: ≈12% (f-puts), 33% (f-gets), and 40% (ML). The overhead of logging puts is due to the fact that every operation is recorded directly after issuing. Traditional message passing protocols suffer from a similar effect [17] . The overhead generated by logging gets in f-puts-gets and ML is more significant because, due to RMA's one-sided semantics, every get has to be recorded remotely. In addition, f-puts-gets suffers from synchronization overheads (caused by concurrent accesses to LG), while ML from inter-process protocol-specific communication. Discussed overheads heavily depend on the application type. Our key-value store constitutes a worstcase scenario because it does not allow for long epochs that could enable, e.g., sending the logs of multiple gets in a bulk. The performance penalties would be smaller in applications that overlap computation with communication and use non blocking gets.
RELATED WORK
In this section we discuss existing checkpointing and logging schemes (see Figure 12 ). For excellent surveys, see [6, 17, 40] . Existing work on fault tolerance in RMA/PGAS is scarce, an example scheme that uses PGAS for data replication can be found in [5] .
Checkpointing Protocols
These schemes are traditionally divided into uncoordinated, coordinated, and communication induced, depending on process coordination scale [17] . There are also complete and incremental protocols that differ in checkpoint sizes [40] .
Uncoordinated Schemes Uncoordinated schemes do not synchronize while checkpointing, but may suffer from the domino effect or complex recoveries [17] . Example protocols are based on dependency [9] or checkpoint graphs [17] . A recent scheme targeting large-scale systems is Ken [41] .
Coordinated Schemes Here, processes synchronize to produce consistent global checkpoints. There is no domino effect and recovery is simple but synchronization may incur severe overheads. Coordinated schemes can be blocking [17] or non-blocking [11] . There are also schemes based on loosely synchronized clocks [39] and minimal coordination [26] .
Communication Induced Schemes Here, senders add scheme-specific data to application messages that receivers use to, e.g., avoid taking useless checkpoints. These schemes can be index-based [21] or model-based [17, 32] .
Incremental Checkpointing An incremental checkpoint updates only the data that changed since the previous checkpoint. These protocols are divided into page-based [40] and hash-based [4] . They can reside at the level of an application, a library, an OS, or hardware [40] . Other schemes can be compiler-enhanced [10] or adaptive [4] .
Others Recently, multi-level checkpointing was introduced [8, 30, 38] . Adaptive checkpointing based on failure prediction is discussed in [28] . [35] presents diskless checkpointing. Other interesting schemes are based on: ReedSolomon coding [8] , cutoff and compression to reduce checkpoint sizes [23] , checkpointing on clouds [33] , reducing I/O bottlenecks [25] , and performant checkpoints to PFS [7] .
Logging Protocols
Logging enables restored processes to replay their execution beyond the most recent checkpoint. Log-based pro-Incremental Complete Optimistic [37] Scale of coordination
Checkpoint-based protocols
Other schemes
Logging-based protocols Pessimistic [18] Causal [16] Receiver-based [17] Non-blocking [11] Uncoordinated [18, 20, 20, 41] Coordinated [25] Comm.-induced Dep. graph [9] Checkp. graph [17] Blocking [17] Way of detecting changes in data Other inc. schemes
Size of checkpoints
Multi-level [8, 30, 38] Page-based [40] Hash-based [4] Diskless [35] Library level OS level Stable storage [7, 25] Adaptive [4] Compiler enhanced [10] Application level
Hardware level
Way of coordination
Way of recovery Moment of checkp.
Other coordinated schemes
Model-based [32] Clock-based [39] Min.-coordination [26] Index-based [17] Moment of logging
Place of logs Addressed comm. model
Sender-based [20, 37] RMA [this paper]
Message-passing [16, 18, 20, 37] Place of residence [40] Checkpointing on clouds. [33] Figure 12: An overview of existing checkpointing and logging schemes ( § 8). A dashed rectangle illustrates a new sub-hierarchy introduced in the paper: dividing the logging protocols with respect to the communication model that they address.
tocols are traditionally categorized into: pessimistic, optimistic, causal [17] ; they can also be sender-based [20, 37] and receiver-based [17] depending on which side logs messages. Pessimistic Schemes Such protocols log events before they influence the system. This ensures no orphan processes and simpler recovery, but may incur severe overheads during fault-free runs. An example protocol is V-MPICH [18] .
Optimistic Schemes Here, processes postpone logging messages to achieve, e.g., better computationcommunication overlap. However, the algorithms for recovery are usually more complicated and crashed processes may become orphans [17] . A recent scheme can be found in [37] .
Causal Schemes In such schemes processes log and exchange (by piggybacking to messages) dependencies needed for recovery. This ensures no orphans but may reduce bandwidth [17] . An example protocol is discussed in [16] .
Other Important Studies & Discussion
Deriving an optimum checkpointing interval is presented in [15] . Formalizations targeting resilience can be found in [17, 32] . Containment domains for encapsulating failures within a hierarchical scope are discussed in [13] . Modeling and prediction of failures is addressed in [8, 13] . Work on send determinism in MP can be found in [20] .
Our study goes beyond the existing research scope presented in this section. First, we develop a fault tolerance model that covers virtually whole rich RMA semantics. Other existing formalizations (e.g., [6, 17, 32] ) target MP only. We then use the model to formally analyze why resilience for RMA differs from MP and to design checkpointing, logging, and recovery protocols for RMA. We identify and propose solutions to several challenges in resilience for RMA that do not exist in MP, e.g.: consistency problems caused by the relaxed RMA memory model ( § 3.1, § 3.2.2, § 3.2.3), access non-determinism ( § 4.2), issues due to onesided RMA communication ( § 3.2.1), logging multiple RMAspecific orders ( § 4.1), etc. Our model enables proving correctness of proposed schemes; all proofs omitted due to space constraints can be found in the technical report. Extending our model for arbitrary hardware hierarchies generalizes the approach from [8] and enables formal reasoning about crashes of hardware elements and process distribution. Finally, our protocol leverages and combines several important concepts and mechanisms (Daly's interval [15] , multi-level design [30] , etc.) to improve the resilience of RMA systems even further and is the first implementation of holistic fault tolerance for RMA.
CONCLUSION
RMA programming models are growing in popularity and importance as they allow for the best utilization of hardware features such as OS-bypass or zero-copy data transfer. Still, little work addresses fault tolerance for RMA.
We established, described, and explored a complete formal model of fault tolerance for RMA and illustrated how to use it to design and reason about resilience protocols running on flat and hierarchical machines. It will play an important role in making emerging RMA programming fault tolerant and can be easily extended to cover, e.g., stable storage.
Our study does not resort to traditional less scalable mechanisms that often rely on costly I/O flushes. The implementation of our holistic protocol adds negligible overheads to the applications runtime, for example 1-5% for in-memory checkpointing and 8% for fully transparent logging of remote memory accesses in the NAS 3D FFT code. Our probability study shows that the protocol offers high resilience. The idea of demand checkpoints will help alleviate the problem of limited memory amounts in today's petascale and future exascale computing centers.
Finally, our work provides the basis for further reasoning about fault-tolerance not only for RMA, but also for all the other models that can be constructed upon it, such as taskbased programming models. This will play an important role in complex heterogeneous large-scale systems.
