66 research outputs found
Consistency Models with Global Operation Sequencing and their Composition
Modern distributed systems often achieve availability and scalability by providing consistency guarantees about the data they manage weaker than linearizability. We consider a class of such consistency models that, despite this weakening, guarantee that clients eventually agree on a global sequence of operations, while seeing a subsequence of this final sequence at any given point of time. Examples of such models include the classical Total Store Order (TSO) and recently proposed dual TSO, Global Sequence Protocol (GSP) and Ordered Sequential Consistency.
We define a unified model, called Global Sequence Consistency (GSC), that has the above models as its special cases, and investigate its key properties. First, we propose a condition under which multiple objects each satisfying GSC can be composed so that the whole set of objects satisfies GSC. Second, we prove an interesting relationship between special cases of GSC - GSP, TSO and dual TSO: we show that clients that do not communicate out-of-band cannot tell the difference between these models. To obtain these results, we propose a novel axiomatic specification of GSC and prove its equivalence to the operational definition of the model
Global Sequence Protocol: A Robust Abstraction for Replicated Shared State
In the age of cloud-connected mobile devices, users want responsive apps that read and write shared data everywhere, at all times, even if network connections are slow or unavailable. The solution is to replicate data and propagate updates asynchronously. Unfortunately, such mechanisms are notoriously difficult to understand, explain, and implement.
To address these challenges, we present GSP (global sequence protocol), an operational model for replicated shared data. GSP is simple and abstract enough to serve as a mental reference model, and offers fine control over the asynchronous update propagation (update transactions, strong synchronization). It abstracts the data model and thus applies both to simple key-value stores, and complex structured data. We then show how to implement GSP robustly on a client-server architecture (masking silent client crashes, server crash-recovery failures, and arbitrary network failures) and efficiently (transmitting and storing minimal information by reducing update sequences)
Consistency in Distributed Systems
Abstract. Data replication is a common technique for programming distributed systems, and is often important to achieve performance or reliability goals. Unfortunately, the replication of data can compromise its consistency, and thereby break programs that are unaware. In particular, in weakly consistent systems, programmers must assume some responsibility to properly deal with queries that return stale data, and to avoid state corruption under conflicting updates. The fundamental tension between performance (favoring weak consistency) and correctness (favoring strong consistency) is a recurring theme when designing concurrent and distributed systems, and is both practically relevant and of theoretical interest. In this course, we investigate how to understand and formalize consistency guarantees, and how we can determine if a system implementation is correct with respect to such specifications. We start by examining consensus, a classic problem in distributed systems, and then proceed to study various specifications and implementations of eventually consistent systems. As more and more developers write programs that execute on a virtualized cloud infrastructure, they find themselves confronted with the subtleties that have long been the hallmark of distributed systems research. Devising message protocols, reading and writing weakly consistent shared data, and handling failures are notoriously challenging, and are gaining relevance for a new generation of developers. With this in mind, I devised this course to provide a mix of techniques and results that may prove either interesting, or useful, or both. In the first half, I am presenting well-known results and techniques from the area of distributed systems research, including: -A beautiful, classic result: the impossibility of implementing consensus in the presence of silent crashes on an asynchronous system In the second half, I focus on the main topic, which are consistency models for shared data. This part includes: Consistency in Distributed Systems 85 -A formalization of strong consistency (sequential consistency, linearizability) and a proof of the CAP theorem These lecture notes are not meant to serve as a transcript. Rather, their purpose is to complement the slides Update: Since giving the original lectures at the LASER summer school, I have expanded and revised much of the material presented in Sects. 3 and 4. The result is now available as a short textbook Preliminaries We introduce some basic mathematical notations for sets, sequences, and relations. We assume standard set notations for set. Note that we write A β B to denote βa β A : a β B. In particular, the notation A β B does neither imply nor rule out either A = B or A = B. We let N be the set of all natural numbers (starting with number 1), and N 0 = N βͺ {0}. The power set P(A) is the set of all subsets of A. Sequences. Given a set A, we let A * be the set of finite sequences (or "words") of elements of A, including the empty sequence which is denoted . We let A + β A * be the set of nonempty sequences of elements of A. Thus, A * = A + βͺ { }. For two sequences u, v β A * , we write u Β· v to denote the concatenation (which is also in A * ). If f : A β B is a function, and w β A * is a sequence, then we let f (w) β B * be the sequence obtained by applying f to each element of w. Sometimes we write A Ο for the set of Ο-infinite sequences of elements of A. Multisets. A finite multiset m over some base set A is defined to be a function m : A β N 0 such that m(a) = 0 for almost all a (= all but finitely many). The idea is that we represent the multiset as the function that defines how many times each element of A is in the set. We let M(A) denote the set of all finite multisets over A. When convenient, we interpret an element a as the singleton multiset containing a. We use the following notations for typical operations on multisets (using a mix of symbols taken from set notations and vector notations), Note that partial orders are acyclic (if there were a cycle, transitivity would imply a β a for some a, contradicting irreflexivity). We often visualize partial orders as directed acyclic graphs. Moreover, in such drawings, we usually omit transitively implied edges, to avoid overloading the picture. A partial order does not necessarily order all elements. In fact, that is precisely what distinguishes it from a total order: a partial order r over A is a total order if for all a, b β A such that a = b, either a r β β b or b r β β a. All total orders are also partial orders. Many authors define partial orders to be reflexive rather than irreflexive. We chose to define them as irreflexive, to keep them more similar to total orders, and to keep the definition more consistent with our favorite visualization, directed acyclic graphs, whose vertices never have self-loops. This choice is only superficial and not a deep distinction: consider the familiar notations < and β€. Conceptually, they represent the same ordering relation, but one of them is reflexive, the other one is irreflexive. In fact, if r is a total or partial order, we sometimes write a < r b to represent a A total order can be used to sort a set. For some finite set A β A and a total order r over A, we let A .sort(r) β A * be the sequence obtained by sorting the elements of A in ascending < r -order. Models and Machines To reason about protocols and consistency, we need terminology and notation that helps us to abstract from details. In particular, we need models for machines, and ways to characterize their behavior by stating and then proving or refuting their properties. Consistency in Distributed Systems 87 Labeled Transition Systems Labeled transitions systems provide a useful formalization and terminology that applies to a wide range of machines. When using an LTS to model a system, a configuration represents a global snapshot of the state of every component of the system. Actions are abstractions that can model a number of activities, such as sending or receiving of messages, interacting with a user, doing some internal processing, or combinations thereof. Labeled transition systems are often visualized using labeled graphs, with vertices representing the states and labeled edges representing the actions. We say an action a β Act is enabled in state s β Cnf if there exists a s β Cnf such that s a β β s . More than one action can be enabled in a state, and in general, an action can lead to more than one successor state. We say an action a is deterministic if that is never the case, that is, if for all s β Cnf, there is at most one s β S such that s a β β s . Defining an LTS to represent a concurrent system helps us to reason precisely about its executions and their correctness. An execution fragment E is a (finite or infinite) alternating sequence of states and actions: and an execution is an execution fragment that starts in an initial state. We formalize these definitions as follows. Definition 2. Given some LTS We define pre(E) = E.cnf(0) and post(E) = E.cnf(E.len) (we write post(E) = β₯ if E.len = β). Two execution fragments E 1 , E 2 can be concatenated to form another execution fragment E 1 Β· E 2 if E 1 .len = β and post(E 1 ) = pre(E 2 ). S. Burckhardt We say a configuration c β Cnf is reachable from a configuration c β Cnf if there exists an execution fragment E such that c = pre(E) and c = post(E). We say a configuration c β Cnf is reachable if it is reachable from an initial configuration. Reasoning about executions usually involves reasoning about events. An event is an occurrence of an action (the same action can occur several times in an execution, each being a separate event). Technically, we define the events of an execution fragment E to be the set of numbers Evt(E) = {1, 2, . . . , E.len}. Then, for events e, e β Evt(E), e < e means e occurs before e in the execution, and E.act(e) is the action of event e. Given an execution fragment E of an LTS L, we let trc(E) β (L.Act * βͺL.Act Ο ) be the (finite of infinite) sequence of actions in E, called the trace of E. If all actions of L are deterministic, then E is completely determined by E.pre and E.trc. For that reason, traces are sometimes called schedules. In our proofs, we often need to take an existing execution, and modify it slightly by reordering certain actions. Given a configuration c and a deterministic action a, we write post(c, a) to be the uniquely determined c satisfying c a β β c , or β₯ if it is not possible (because a is not enabled in c). Similarly, we write post(c, w), for an action sequence w β A * , to denote the state reached from c by performing the actions in w, or β₯ if not possible. In the remainder of this text, all of our LTS are constructed in such a way that all actions are deterministic. Working with deterministic actions can have practical advantages. For testing and debugging protocols, we often need to analyze or reproduce failures based on partial information about the execution, such as a trace log. If the log contains the sequence of actions in the order they happened, and if the actions are deterministic, it means that the log contains sufficient information to fully reproduce the execution. Asynchronous Message Protocols An LTS can express many different kinds of concurrent systems, but we care mostly about message passing protocols in this context. Therefore, we specialize the general LTS definition above to define such systems. Throughout this text, we assume that Pid is a set of process identifiers (possibly infinite, to model dynamic creation). Furthermore, we assume that there is a total order defined on the process identifiers Pid. For example, Pid = N. Msg, Act, ini, ori, dst, pid, cnd, rcv, snd, upd) where -Pst is a set of process states, with a function Definition 3. A protocol definition is a tuple -only finitely many actions apply at a time: We call actions a that receive no message (i.e. rcv(a) = β₯) spontaneous. The meaning is that each configuration is a pair (P, M ) with P being a function that maps each process identifier to the current state of that process, and M being a multiset that represents messages that are currently "in flight". For a configuration c, we write c.P and c.M to denote its components. When reasoning about an execution E of L Ξ¦ , we define the following notational shortcut: Example. Consider a simple protocol where the processes try to reach consensus on a single bit. We assume that the initial state of each process contains the bit value it is going to propose. We can implement a simple leader-based protocol to reach consensus by fixing some leader process l β Pid. The idea is based on a "race to the leader", which works in three stages: (1) each process sends a message containing the bit value it is proposing to the leader, (2) the leader, upon receiving any message, announces this value to all other processes, and (3) upon receiving the announced message, each recipient decides on that value. We show how to write pseudocode for this protocol in Each message has a name and several named typed parameters. We show how the functions ori and dst (which determine the origin and destination of each message) are defined in the comment at the end of each line. -The remaining sections define the actions, with one section per action. The entries have the following meaning: β’ The first line of each action section defines the action label, which is a name together with named typed parameters. All action labels together constitute the set Act. The comment at the end of the line defines the pid function, which determines the process to which this action belongs. Consistency in Distributed Systems 91 β’ The receives section defines the rcv function. If there is a receives line present, it defines the message that is received by this action, and if there is no receives line, it specifies that this action is spontaneous. β’ The sends section defines the snd function. It specifies the message, or the multiset of messages, to be sent by this action. We use the multiset notations as described in Sect. 1, in particular, the sum symbol is used to describe a collection of messages. We omit this section if no messages are sent. β’ The condition section defines the cnd function, representing a condition that is necessary for this action to be performed. It describes a predicate over the local process state (i.e. over the variables defined in the process state section). We omit this section if the action is unconditional. β’ The updates section defines the upd function, by specifying how to update the local process state. We omit this section if the process state is not changed. One could conceivably formalize these definitions and produce a practically usable programming language for protocols; in fact, this has already been done for the programming language used by the MurΟ tool Consider the consensus protocol shown in Consensus Protocols What makes a protocol a consensus protocol? Somehow, we start out with a bit on each participant describing its preference. When the protocol is done, everyone should agree on some bit value that was one of the proposed values. And, there should be progress eventually, i.e. the protocol should terminate with a decision. We now formalize what we mean by a consensus protocol, by adding functions to formalize the notions of initial preference and of decisions. (Pst, Msg, Act, ini, ori, dst, pid, cnd, rcv, snd, upd, pref, dec) such that -(Pst, . . . , upd) is a protocol. For example, for the strawman protocol, we define pref(p, b).preference = b and pref (p, b).decision = β₯, and we define dec(s) = s.decision. Definition 5. A consensus protocol is a tuple Next, we formalize the correctness conditions we briefly outlined at the beginning of this section, and then examine if they hold for our strawman. For an execution E, we define the following properties: 1. Stability. If a value is decided at a process p, it remains decided forever: 2. Agreement. No two processes should decide differently: If a value is decided, this value must match the preference of at least one of the processes: Eventually, a decision is reached on all correct 1 processes: Does our strawman protocol satisfy all of these properties, for all of its executions? Certainly, this is true for the first three. 1. Strawman satisfies agreement and stability. There can be at most one announce event, because only the leader can perform the announce action, and the leader sets the decided variable to true after doing the announce, which prevents further announce actions. Therefore, all decide actions must receive a Announcement message sent by the same announce event, thus all the actions that write a decision value write the same value. Decision values are stable: there is no action that writes β₯ to the decision variable. 2. Strawman satisfies validity. Any announce event (for some bit b) receives a Proposal message that must have originated in some propose event (with the same bit b), which has as a precondition that the variable proposal = b. Thus, b matches the preference of that process. Termination is however not satisfied for all executions. For example, in an execution of length 0, no decision is reached. Perhaps it would be more reasonable to restrict our attention to complete executions: 1 We talk more about failures later. For now, just assume that the set F of faulty processes is empty. Clearly, no progress is made and an unbounded number of messages is sent. No decision is reached. Still, it appears that this criticism is not fair! It is hard to imagine how any protocol can achieve termination unless the transport layer and the process scheduler cooperate. Clearly, if the system simply does not deliver messages, or never executes actions even though they are enabled, nothing good can happen. We need fairness: some assumptions about the "minimal level of service" we may expect. Informally, what we want to require is that messages are eventually delivered unless they become undeliverable, and that spontaneous actions are eventually performed unless they become disabled. We say an action a β Act receives message m β Msg if rcv(a) = m. We say m β Msg is receivable in a configuration s if there exists an action a that is enabled and that receives m. Definition 7. A message m is neglected by an execution E if it is receivable in infinitely many configurations, but received by only finitely many actions. A spontaneous action a is neglected by an execution E, if it is enabled in infinitely many configurations, but performed only finitely many times. Definition 8. An execution E of some protocol Ξ¦ is fair if it does not neglect any messages or spontaneous actions. Definition 9. A consensus protocol is a correct consensus protocol if all fair complete executions satisfy stability, agreement, validity, and termination. Strawman is Correct. We already discussed agreement and validity. Termination is also satisfied for fair executions, for the following reasons. Because the propose action is always enabled for all p, it must happen at least once (in fact, it will happen infinitely many times for all p). After it happens just once, announce is now enabled, and remains enabled forever if announce does not happen. Thus announce must happen (otherwise fairness is violated). But now, for each q, decide is enabled, and thus must happen eventually. Fair Schedulers. The definition of fairness is purposefully quite general; it does not describe how exactly a scheduler is guaranteeing fairness. However, it is useful to consider how to construct a scheduler that guarantees fairness. One way to do so is to schedule an action that has maximal seniority, in the sense that it is executing a spontaneous action or receiving a message that has been waiting (i.e. been enabled/receivable but not executed/received) the longest: Proof. Assume to the contrary that there exists an execution that is not fair, that is, neglects a message or spontaneous action. First, consider that a message m is neglected. This means that the message is receivable infinitely often, but received only finitely many times. Consider the first configuration where it is receivable after the last time it is received, say E.cnf(k). Since m is receivable in infinitely many configurations {E.cnf(k ) | k > k} but never received, there must be infinitely many configurations {E.cnf(k ) | k > k} where some enabled action is more senior than the one that receives m (otherwise the scheduler would pick that one). However, an action can only be more senior than the one that receives m if it is either receiving some message that has been waiting (i.e. has been receivable without being received) at least as long as m, or a spontaneous action that has been waiting (i.e. has been enabled without being performed) at least as long as m. But there can only be finitely many such messages or spontaneous actions, since there are only finitely many configurations {E.cnf(j) | j β€ k}, and each such configuration has only finitely many receivable messages and enabled spontaneous actions, by the last condition in Definition 3; thus we have a contradiction. Now, consider that a spontaneous action is neglected. We get a contradiction by the same reasoning. Independence. The notion of independence of actions and schedules is also often useful. We can define independence for general labeled transition systems as follows: For protocols, actions performed by different nodes are independent. This is because executing an action for process p can only remove messages destined for p from the message pool, it can thus not disable any actions on any other process. Actions by different processes always commute, because their effect on the local state targets local states by different processes, and their effects on the message pool commute. Consistency in Distributed Systems 95 We call two schedules s, s β Act * independent if for all a β s and a β s , a and a are independent. Note that if two schedules s, s are independent and possible in some configuration c, then post(c, s Β· s ) = post(c, s Β· s). Visually, this can be seen by doing a typical tiling argument. Failures As we probably all know from experience, failures are common in distributed systems. Failures can originate in the transport layer (a logical abstraction of the network, including switches, links, proxies, etc.) or the nodes (computers running the protocol software). Sometimes, the distinction is not that clear (for example, messages that are waiting in buffers are conceptually in the transport layer, but are subject to loss if the node fails). We now show how, given a protocol Ξ¦ and its LTS as defined in Sect. 2.2, Definition 3, we can model failures by adding failure actions to the LTS defined in Definition 4. Modeling Transport Failures. Failures for message delivery often include (1) reordering, (2) loss, (3) duplication, and (4) injection of messages. In our protocol model, reorderings are already allowed, thus we do not consider them to be a failure. To model message loss, we can add the following action to the LTS: Similarly, we can add an action for message duplication: We can also model injection of arbitrary messages: However, we will not talk more about the latter, which is considered a byzantine failure, and which opens up a whole new category of challenges and results. Masking Transport Failures. Protocols can mask message reordering, loss, and duplication by affixing sequence numbers to messages, and using send and receive buffers. Receivers can detect missing messages in the sequence and rerequest them. In fact, socket protocols (such as TCP) use this type of mechanism (e.g. slidi
CheckFence: Checking Consistency of Concurrent Data Types on Relaxed Memory Models
Concurrency libraries can facilitate the development of multithreaded programs by providing concurrent implementations of familiar data types such as queues or sets. There exist many optimized algorithms that can achieve superior performance on multiprocessors by allowing concurrent data accesses without using locks. Unfortunately, such algorithms can harbor subtle concurrency bugs. Moreover, they require memory ordering fences to function correctly on relaxed memory models. To address these difficulties, we propose a verification approach that can exhaustively check all concurrent executions of a given test program on a relaxed memory model and can verify that they are observationally equivalent to a sequential execution. Our Check- Fence prototype automatically translates the C implementation code and the test program into a SAT formula, hands the latter to a standard SAT solver, and constructs counterexample traces if there exist incorrect executions. Applying CheckFence to five previously published algorithms, we were able to (1) find several bugs (some not previously known), and (2) determine how to place memory ordering fences for relaxed memory models
Multicore Acceleration for Priority Based Schedulers for Concurrency Bug Detection
Testing multithreaded programs is difficult as threads can interleave in a nondeterministic fashion. Untested interleavings can cause failures, but testing all interleavings is infeasible. Many interleaving exploration strategies for bug detection have been proposed, but their relative effectiveness and performance remains unclear as they often lack publicly available implementations and have not been evaluated using common benchmarks. We describe NeedlePoint, an open-source framework that allows selection and comparison of a wide range of interleaving exploration policies for bug detection proposed by prior work. Our experience with NeedlePoint indicates that priority-based probabilistic concurrency testing (the PCT algorithm) finds bugs quickly, but it runs only one thread at a time, which destroys parallelism by serializing executions. To address this problem we propose a parallel version of the PCT algorithm (PPCT).We show that the new algorithm outperforms the original by a factor of 5x when testing parallel programs on an eight-core machine. We formally prove that parallel PCT provides the same probabilistic coverage guarantees as PCT. Moreover, PPCT is the first algorithm that runs multiple threads while providing coverage guarantees
TouchDevelop: Create Rich Mobile Apps on Touch Devices
ABSTRACT We are experiencing a technology shift: Powerful and easyto-use mobile devices like smartphones and tablets are becoming more prevalent than traditional PCs and laptops. Mobile devices are going to be the first and, in less developed countries, possibly the only computing devices which virtually all people will own and carry with them at all times. In this tutorial, participants will learn about developing software directly on their mobile devices. The tutorial is based on TouchDevelop, a modern software development environment that embraces this new reality, treating mobile devices as first-class software development machines, instead of relying on legacy development models built around PC. TouchDevelop comes with typed, structured programming language that is built around the idea of only using a touchscreen as the input device to author code. Access to the cloud, flexible user interfaces, and access to sensors such as accelerometer and GPS are available as a first-class citizens in the programming language. TouchDevelop is available as a web app on Windows tablets, iOS, Android, Windows PCs and Macs, and as a native app on Windows Phone. Categories and Subject Descriptors General Terms Languages Keywords mobile devices, Web IDE, smart phone, tablet, touch-based entry Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MOBILESoft '14, June 2-3, 2014, Hyderabad, India Copyright 2014 ACM 978-1-4503-2878-4/14/06 ...$15.00. OBJECTIVES The goal of this tutorial is to show to mobile software engineering researchers how software authoring for mobile devices can evolve beyond PC-centric legacy approaches. The tutorial embraces a new mobile-device-centric concept around a cloud-based modern programming language and programming environment. Participants will gain the following skills and knowledge: β’ how to use mobile devices to create mobile applications, without the use of a PC β’ a simplified approach to designing user interfaces and interacting with cloud-based data β’ how to easily share applications, and communicate with users β’ how to gather and analyze crowd-sourced insights data about one's applications β’ how to deal with a heterogeneous environment of devices and operating systems
Herding Cats: Modelling, Simulation, Testing, and Data Mining for Weak Memory
We propose an axiomatic generic framework for modelling weak memory. We show how to instantiate this framework for SC, TSO, C++ restricted to release-acquire atomics, and Power. For Power, we compare our model to a preceding operational model in which we found a flaw. To do so, we define an operational model that we show equivalent to our axiomatic model. We also propose a model for ARM. Our testing on this architecture revealed a behaviour later acknowl-edged as a bug by ARM, and more recently 31 additional anomalies. We offer a new simulation tool, called herd, which allows the user to specify the model of his choice in a concise way. Given a specification of a model, the tool becomes a simulator for that model. The tool relies on an axiomatic description; this choice allows us to outperform all previous simulation tools. Additionally, we confirm that verification time is vastly improved, in the case of bounded model checking. Finally, we put our models in perspective, in the light of empirical data obtained by analysing the C and C++ code of a Debian Linux distribution. We present our new analysis tool, called mole, which explores a piece of code to find the weak memory idioms that it uses
- β¦