Abstract. Data replication is a common technique for programming distributed systems, and is often important to achieve performance or reliability goals. Unfortunately, the replication of data can compromise its consistency, and thereby break programs that are unaware. In particular, in weakly consistent systems, programmers must assume some responsibility to properly deal with queries that return stale data, and to avoid state corruption under conflicting updates. The fundamental tension between performance (favoring weak consistency) and correctness (favoring strong consistency) is a recurring theme when designing concurrent and distributed systems, and is both practically relevant and of theoretical interest. In this course, we investigate how to understand and formalize consistency guarantees, and how we can determine if a system implementation is correct with respect to such specifications. We start by examining consensus, a classic problem in distributed systems, and then proceed to study various specifications and implementations of eventually consistent systems. As more and more developers write programs that execute on a virtualized cloud infrastructure, they find themselves confronted with the subtleties that have long been the hallmark of distributed systems research. Devising message protocols, reading and writing weakly consistent shared data, and handling failures are notoriously challenging, and are gaining relevance for a new generation of developers. With this in mind, I devised this course to provide a mix of techniques and results that may prove either interesting, or useful, or both. In the first half, I am presenting well-known results and techniques from the area of distributed systems research, including: -A beautiful, classic result: the impossibility of implementing consensus in the presence of silent crashes on an asynchronous system In the second half, I focus on the main topic, which are consistency models for shared data. This part includes: Consistency in Distributed Systems 85 -A formalization of strong consistency (sequential consistency, linearizability) and a proof of the CAP theorem These lecture notes are not meant to serve as a transcript. Rather, their purpose is to complement the slides Update: Since giving the original lectures at the LASER summer school, I have expanded and revised much of the material presented in Sects. 3 and 4. The result is now available as a short textbook Preliminaries We introduce some basic mathematical notations for sets, sequences, and relations. We assume standard set notations for set. Note that we write A ⊆ B to denote ∀a ∈ A : a ∈ B. In particular, the notation A ⊆ B does neither imply nor rule out either A = B or A = B. We let N be the set of all natural numbers (starting with number 1), and N 0 = N ∪ {0}. The power set P(A) is the set of all subsets of A. Sequences. Given a set A, we let A * be the set of finite sequences (or "words") of elements of A, including the empty sequence which is denoted . We let A + ⊆ A * be the set of nonempty sequences of elements of A. Thus, A * = A + ∪ { }. For two sequences u, v ∈ A * , we write u · v to denote the concatenation (which is also in A * ). If f : A → B is a function, and w ∈ A * is a sequence, then we let f (w) ∈ B * be the sequence obtained by applying f to each element of w. Sometimes we write A ω for the set of ω-infinite sequences of elements of A. Multisets. A finite multiset m over some base set A is defined to be a function m : A → N 0 such that m(a) = 0 for almost all a (= all but finitely many). The idea is that we represent the multiset as the function that defines how many times each element of A is in the set. We let M(A) denote the set of all finite multisets over A. When convenient, we interpret an element a as the singleton multiset containing a. We use the following notations for typical operations on multisets (using a mix of symbols taken from set notations and vector notations), Note that partial orders are acyclic (if there were a cycle, transitivity would imply a → a for some a, contradicting irreflexivity). We often visualize partial orders as directed acyclic graphs. Moreover, in such drawings, we usually omit transitively implied edges, to avoid overloading the picture. A partial order does not necessarily order all elements. In fact, that is precisely what distinguishes it from a total order: a partial order r over A is a total order if for all a, b ∈ A such that a = b, either a r − → b or b r − → a. All total orders are also partial orders. Many authors define partial orders to be reflexive rather than irreflexive. We chose to define them as irreflexive, to keep them more similar to total orders, and to keep the definition more consistent with our favorite visualization, directed acyclic graphs, whose vertices never have self-loops. This choice is only superficial and not a deep distinction: consider the familiar notations < and ≤. Conceptually, they represent the same ordering relation, but one of them is reflexive, the other one is irreflexive. In fact, if r is a total or partial order, we sometimes write a < r b to represent a A total order can be used to sort a set. For some finite set A ⊆ A and a total order r over A, we let A .sort(r) ∈ A * be the sequence obtained by sorting the elements of A in ascending < r -order. Models and Machines To reason about protocols and consistency, we need terminology and notation that helps us to abstract from details. In particular, we need models for machines, and ways to characterize their behavior by stating and then proving or refuting their properties. Consistency in Distributed Systems 87 Labeled Transition Systems Labeled transitions systems provide a useful formalization and terminology that applies to a wide range of machines. When using an LTS to model a system, a configuration represents a global snapshot of the state of every component of the system. Actions are abstractions that can model a number of activities, such as sending or receiving of messages, interacting with a user, doing some internal processing, or combinations thereof. Labeled transition systems are often visualized using labeled graphs, with vertices representing the states and labeled edges representing the actions. We say an action a ∈ Act is enabled in state s ∈ Cnf if there exists a s ∈ Cnf such that s a − → s . More than one action can be enabled in a state, and in general, an action can lead to more than one successor state. We say an action a is deterministic if that is never the case, that is, if for all s ∈ Cnf, there is at most one s ∈ S such that s a − → s . Defining an LTS to represent a concurrent system helps us to reason precisely about its executions and their correctness. An execution fragment E is a (finite or infinite) alternating sequence of states and actions: and an execution is an execution fragment that starts in an initial state. We formalize these definitions as follows. Definition 2. Given some LTS We define pre(E) = E.cnf(0) and post(E) = E.cnf(E.len) (we write post(E) = ⊥ if E.len = ∞). Two execution fragments E 1 , E 2 can be concatenated to form another execution fragment E 1 · E 2 if E 1 .len = ∞ and post(E 1 ) = pre(E 2 ). S. Burckhardt We say a configuration c ∈ Cnf is reachable from a configuration c ∈ Cnf if there exists an execution fragment E such that c = pre(E) and c = post(E). We say a configuration c ∈ Cnf is reachable if it is reachable from an initial configuration. Reasoning about executions usually involves reasoning about events. An event is an occurrence of an action (the same action can occur several times in an execution, each being a separate event). Technically, we define the events of an execution fragment E to be the set of numbers Evt(E) = {1, 2, . . . , E.len}. Then, for events e, e ∈ Evt(E), e < e means e occurs before e in the execution, and E.act(e) is the action of event e. Given an execution fragment E of an LTS L, we let trc(E) ∈ (L.Act * ∪L.Act ω ) be the (finite of infinite) sequence of actions in E, called the trace of E. If all actions of L are deterministic, then E is completely determined by E.pre and E.trc. For that reason, traces are sometimes called schedules. In our proofs, we often need to take an existing execution, and modify it slightly by reordering certain actions. Given a configuration c and a deterministic action a, we write post(c, a) to be the uniquely determined c satisfying c a − → c , or ⊥ if it is not possible (because a is not enabled in c). Similarly, we write post(c, w), for an action sequence w ∈ A * , to denote the state reached from c by performing the actions in w, or ⊥ if not possible. In the remainder of this text, all of our LTS are constructed in such a way that all actions are deterministic. Working with deterministic actions can have practical advantages. For testing and debugging protocols, we often need to analyze or reproduce failures based on partial information about the execution, such as a trace log. If the log contains the sequence of actions in the order they happened, and if the actions are deterministic, it means that the log contains sufficient information to fully reproduce the execution. Asynchronous Message Protocols An LTS can express many different kinds of concurrent systems, but we care mostly about message passing protocols in this context. Therefore, we specialize the general LTS definition above to define such systems. Throughout this text, we assume that Pid is a set of process identifiers (possibly infinite, to model dynamic creation). Furthermore, we assume that there is a total order defined on the process identifiers Pid. For example, Pid = N. Msg, Act, ini, ori, dst, pid, cnd, rcv, snd, upd) where -Pst is a set of process states, with a function Definition 3. A protocol definition is a tuple -only finitely many actions apply at a time: We call actions a that receive no message (i.e. rcv(a) = ⊥) spontaneous. The meaning is that each configuration is a pair (P, M ) with P being a function that maps each process identifier to the current state of that process, and M being a multiset that represents messages that are currently "in flight". For a configuration c, we write c.P and c.M to denote its components. When reasoning about an execution E of L Φ , we define the following notational shortcut: Example. Consider a simple protocol where the processes try to reach consensus on a single bit. We assume that the initial state of each process contains the bit value it is going to propose. We can implement a simple leader-based protocol to reach consensus by fixing some leader process l ∈ Pid. The idea is based on a "race to the leader", which works in three stages: (1) each process sends a message containing the bit value it is proposing to the leader, (2) the leader, upon receiving any message, announces this value to all other processes, and (3) upon receiving the announced message, each recipient decides on that value. We show how to write pseudocode for this protocol in Each message has a name and several named typed parameters. We show how the functions ori and dst (which determine the origin and destination of each message) are defined in the comment at the end of each line. -The remaining sections define the actions, with one section per action. The entries have the following meaning: • The first line of each action section defines the action label, which is a name together with named typed parameters. All action labels together constitute the set Act. The comment at the end of the line defines the pid function, which determines the process to which this action belongs. Consistency in Distributed Systems 91 • The receives section defines the rcv function. If there is a receives line present, it defines the message that is received by this action, and if there is no receives line, it specifies that this action is spontaneous. • The sends section defines the snd function. It specifies the message, or the multiset of messages, to be sent by this action. We use the multiset notations as described in Sect. 1, in particular, the sum symbol is used to describe a collection of messages. We omit this section if no messages are sent. • The condition section defines the cnd function, representing a condition that is necessary for this action to be performed. It describes a predicate over the local process state (i.e. over the variables defined in the process state section). We omit this section if the action is unconditional. • The updates section defines the upd function, by specifying how to update the local process state. We omit this section if the process state is not changed. One could conceivably formalize these definitions and produce a practically usable programming language for protocols; in fact, this has already been done for the programming language used by the Murφ tool Consider the consensus protocol shown in Consensus Protocols What makes a protocol a consensus protocol? Somehow, we start out with a bit on each participant describing its preference. When the protocol is done, everyone should agree on some bit value that was one of the proposed values. And, there should be progress eventually, i.e. the protocol should terminate with a decision. We now formalize what we mean by a consensus protocol, by adding functions to formalize the notions of initial preference and of decisions. (Pst, Msg, Act, ini, ori, dst, pid, cnd, rcv, snd, upd, pref, dec) such that -(Pst, . . . , upd) is a protocol. For example, for the strawman protocol, we define pref(p, b).preference = b and pref (p, b).decision = ⊥, and we define dec(s) = s.decision. Definition 5. A consensus protocol is a tuple Next, we formalize the correctness conditions we briefly outlined at the beginning of this section, and then examine if they hold for our strawman. For an execution E, we define the following properties: 1. Stability. If a value is decided at a process p, it remains decided forever: 2. Agreement. No two processes should decide differently: If a value is decided, this value must match the preference of at least one of the processes: Eventually, a decision is reached on all correct 1 processes: Does our strawman protocol satisfy all of these properties, for all of its executions? Certainly, this is true for the first three. 1. Strawman satisfies agreement and stability. There can be at most one announce event, because only the leader can perform the announce action, and the leader sets the decided variable to true after doing the announce, which prevents further announce actions. Therefore, all decide actions must receive a Announcement message sent by the same announce event, thus all the actions that write a decision value write the same value. Decision values are stable: there is no action that writes ⊥ to the decision variable. 2. Strawman satisfies validity. Any announce event (for some bit b) receives a Proposal message that must have originated in some propose event (with the same bit b), which has as a precondition that the variable proposal = b. Thus, b matches the preference of that process. Termination is however not satisfied for all executions. For example, in an execution of length 0, no decision is reached. Perhaps it would be more reasonable to restrict our attention to complete executions: 1 We talk more about failures later. For now, just assume that the set F of faulty processes is empty. Clearly, no progress is made and an unbounded number of messages is sent. No decision is reached. Still, it appears that this criticism is not fair! It is hard to imagine how any protocol can achieve termination unless the transport layer and the process scheduler cooperate. Clearly, if the system simply does not deliver messages, or never executes actions even though they are enabled, nothing good can happen. We need fairness: some assumptions about the "minimal level of service" we may expect. Informally, what we want to require is that messages are eventually delivered unless they become undeliverable, and that spontaneous actions are eventually performed unless they become disabled. We say an action a ∈ Act receives message m ∈ Msg if rcv(a) = m. We say m ∈ Msg is receivable in a configuration s if there exists an action a that is enabled and that receives m. Definition 7. A message m is neglected by an execution E if it is receivable in infinitely many configurations, but received by only finitely many actions. A spontaneous action a is neglected by an execution E, if it is enabled in infinitely many configurations, but performed only finitely many times. Definition 8. An execution E of some protocol Φ is fair if it does not neglect any messages or spontaneous actions. Definition 9. A consensus protocol is a correct consensus protocol if all fair complete executions satisfy stability, agreement, validity, and termination. Strawman is Correct. We already discussed agreement and validity. Termination is also satisfied for fair executions, for the following reasons. Because the propose action is always enabled for all p, it must happen at least once (in fact, it will happen infinitely many times for all p). After it happens just once, announce is now enabled, and remains enabled forever if announce does not happen. Thus announce must happen (otherwise fairness is violated). But now, for each q, decide is enabled, and thus must happen eventually. Fair Schedulers. The definition of fairness is purposefully quite general; it does not describe how exactly a scheduler is guaranteeing fairness. However, it is useful to consider how to construct a scheduler that guarantees fairness. One way to do so is to schedule an action that has maximal seniority, in the sense that it is executing a spontaneous action or receiving a message that has been waiting (i.e. been enabled/receivable but not executed/received) the longest: Proof. Assume to the contrary that there exists an execution that is not fair, that is, neglects a message or spontaneous action. First, consider that a message m is neglected. This means that the message is receivable infinitely often, but received only finitely many times. Consider the first configuration where it is receivable after the last time it is received, say E.cnf(k). Since m is receivable in infinitely many configurations {E.cnf(k ) | k > k} but never received, there must be infinitely many configurations {E.cnf(k ) | k > k} where some enabled action is more senior than the one that receives m (otherwise the scheduler would pick that one). However, an action can only be more senior than the one that receives m if it is either receiving some message that has been waiting (i.e. has been receivable without being received) at least as long as m, or a spontaneous action that has been waiting (i.e. has been enabled without being performed) at least as long as m. But there can only be finitely many such messages or spontaneous actions, since there are only finitely many configurations {E.cnf(j) | j ≤ k}, and each such configuration has only finitely many receivable messages and enabled spontaneous actions, by the last condition in Definition 3; thus we have a contradiction. Now, consider that a spontaneous action is neglected. We get a contradiction by the same reasoning. Independence. The notion of independence of actions and schedules is also often useful. We can define independence for general labeled transition systems as follows: For protocols, actions performed by different nodes are independent. This is because executing an action for process p can only remove messages destined for p from the message pool, it can thus not disable any actions on any other process. Actions by different processes always commute, because their effect on the local state targets local states by different processes, and their effects on the message pool commute. Consistency in Distributed Systems 95 We call two schedules s, s ∈ Act * independent if for all a ∈ s and a ∈ s , a and a are independent. Note that if two schedules s, s are independent and possible in some configuration c, then post(c, s · s ) = post(c, s · s). Visually, this can be seen by doing a typical tiling argument. Failures As we probably all know from experience, failures are common in distributed systems. Failures can originate in the transport layer (a logical abstraction of the network, including switches, links, proxies, etc.) or the nodes (computers running the protocol software). Sometimes, the distinction is not that clear (for example, messages that are waiting in buffers are conceptually in the transport layer, but are subject to loss if the node fails). We now show how, given a protocol Φ and its LTS as defined in Sect. 2.2, Definition 3, we can model failures by adding failure actions to the LTS defined in Definition 4. Modeling Transport Failures. Failures for message delivery often include (1) reordering, (2) loss, (3) duplication, and (4) injection of messages. In our protocol model, reorderings are already allowed, thus we do not consider them to be a failure. To model message loss, we can add the following action to the LTS: Similarly, we can add an action for message duplication: We can also model injection of arbitrary messages: However, we will not talk more about the latter, which is considered a byzantine failure, and which opens up a whole new category of challenges and results. Masking Transport Failures. Protocols can mask message reordering, loss, and duplication by affixing sequence numbers to messages, and using send and receive buffers. Receivers can detect missing messages in the sequence and rerequest them. In fact, socket protocols (such as TCP) use this type of mechanism (e.g. slidi