64 research outputs found

    Reconfigurable Atomic Transaction Commit (Extended Version)

    Full text link
    Modern data stores achieve scalability by partitioning data into shards and fault-tolerance by replicating each shard across several servers. A key component of such systems is a Transaction Certification Service (TCS), which atomically commits a transaction spanning multiple shards. Existing TCS protocols require 2f+1 crash-stop replicas per shard to tolerate f failures. In this paper we present atomic commit protocols that require only f+1 replicas and reconfigure the system upon failures using an external reconfiguration service. We furthermore rigorously prove that these protocols correctly implement a recently proposed TCS specification. We present protocols in two different models--the standard asynchronous message-passing model and a model with Remote Direct Memory Access (RDMA), which allows a machine to access the memory of another machine over the network without involving the latter's CPU. Our protocols are inspired by a recent FARM system for RDMA-based transaction processing. Our work codifies the core ideas of FARM as distributed TCS protocols, rigorously proves them correct and highlights the trade-offs required by the use of RDMA.Comment: Extended version of the PODC' 19 paper: Reconfigurable Atomic Transaction Commi

    Improving the benefits of multicast prioritization algorithms

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-014-1087-zPrioritized atomic multicast consists in delivering messages in total order while ensuring that the priorities of the messages are considered; i.e., messages with higher priorities are delivered first. That service can be used in multiple applications. An example is the usage of prioritization algorithms for reducing the transaction abort rates in applications that use a replicated database system. To this end, transaction messages get priorities according to their probability of violating the existing integrity constraints. This paper evaluates how that abort reduction may be improved varying the message sending rate and the bounds set on the length of the priority reordering queue being used by those multicast algorithms.This work has been partially supported by EU FEDER and Spanish MICINN under research Grants TIN2009-14460-C03-01 and TIN2010-17193.Miedes De Elías, EP.; Muñoz Escoí, FD. (2014). Improving the benefits of multicast prioritization algorithms. Journal of Supercomputing. 68(3):1280-1301. doi:10.1007/s11227-014-1087-zS12801301683Amir Y, Danilov C, Stanton JR (2000) A low latency, loss tolerant architecture and protocol for wide area group communication. In: International Conference on Dependable Systems and Networks (DSN), IEEE-CS, Washington, DC, USA, pp 327–336Chockler G, Keidar I, Vitenberg R (2001) Group communication specifications: a comprehensive study. ACM Comput Surv 33(4):427–469CiA (2001) About CAN in Automation (CiA). http://www.can-cia.org/index.php?id=aboutciaDéfago X, Schiper A, Urbán P (2004) Total order broadcast and multicast algorithms: taxonomy and survey. ACM Comput Surv 36(4):372–421Dolev D, Dwork C, Stockmeyer L (1987) On the minimal synchronism needed for distributed consensus. J ACM 34(1):77–97International Organization for Standardization (ISO) (1993) Road vehicles—interchange of digital information—controller area network (CAN) for high-speed communication. Revised by ISO 11898-1:2003JBoss (2011) The Netty project 3.2 user guide. http://docs.jboss.org/netty/3.2/guide/html/Kaashoek MF, Tanenbaum AS (1996) An evaluation of the Amoeba group communication system. In: International conference on distributed computing system (ICDCS), IEEE-CS, Washington, DC, USA, pp 436–448Miedes E, Muñoz-Escoí FD (2008) Managing priorities in atomic multicast protocols. In: International conference on availability, reliability and security (ARES), Barcelona, Spain, pp 514–519Miedes E, Muñoz-Escoí FD (2010) Dynamic switching of total-order broadcast protocols. In: International conference on parallel and distributed processing techniques and applications (PDPTA), CSREA Press, Las Vegas, Nevada, USA, pp 457–463Miedes E, Muñoz-Escoí FD, Decker H (2008) Reducing transaction abort rates with prioritized atomic multicast protocols. In: International European conference on parallel and distributed computing (Euro-Par), Springer, Las Palmas de Gran Canaria, Spain, Lecture notes in computer science, vol 5168, pp 394–403Mocito J, Rodrigues L (2006) Run-time switching between total order algorithms. In: International European conference on parallel and distributed computing (Euro-Par), Springer, Dresden, Germany, Lecture Notes in Computer Science, vol 4128, pp 582–591Moser LE, Melliar-Smith PM, Agarwal DA, Budhia R, Lingley-Papadopoulos C (1996) Totem: a fault-tolerant multicast group communication system. Commun ACM 39(4):54–63Nakamura A, Takizawa M (1992) Priority-based total and semi-total ordering broadcast protocols. In: International conference on distributed computing systems (ICDCS), Yokohama, Japan, pp 178–185Nakamura A, Takizawa M (1993) Starvation-prevented priority based total ordering broadcast protocol on high-speed single channel network. In: 2nd International symposium on high performance distributed computing (HPDC), pp 281–288Rodrigues L, Veríssimo P, Casimiro A (1995) Priority-based totally ordered multicast. In: Workshop on algorithms and architectures for real-time control (AARTC), Ostend, BelgiumRütti O, Wojciechowski P, Schiper A (2006) Structural and algorithmic issues of dynamic protocol update. In: 20th International parallel and distributed processing symposium (IPDPS), IEEE-CS Press, Rhodes Island, GreeceTindell K, Clark J (1994) Holistic schedulability analysis for distributed hard real-time systems. Microprocess Microprogr 40(2–3):117–134Tully A, Shrivastava SK (1990) Preventing state divergence in replicated distributed programs. In: International symposium on reliable distributed systems (SRDS), Huntsville, Alabama, USA, pp 104–113Wiesmann M, Schiper A (2005) Comparison of database replication techniques based on total order broadcast. IEEE Trans Knowl Data Eng 17(4):551–56

    Design and Implementation of a Scalable Membership Service for Supercomputer Resiliency-Aware Runtime

    Full text link
    As HPC systems and applications get bigger and more complex, we are approaching an era in which resiliency and run-time elasticity concerns become paramount. We offer a building block for an alternative resiliency approach in which computations will be able to make progress while components fail, in addition to enabling a dynamic set of nodes throughout a computation lifetime. The core of our solution is a hierarchical scalable membership service providing eventual consistency semantics. An attribute replication service is used for hierarchy organization, and is exposed to external applications. Our solution is based on P2P technologies and provides resiliency and elastic runtime support at ultra large scales. Resulting middleware is general purpose while exploiting HPC platform unique features and architecture. We have implemented and tested this system on BlueGene/P with Linux, and using worst-case analysis, evaluated the service scalability as effective for up to 1M nodes

    Scalable transactions in the cloud: partitioning revisited

    Get PDF
    Lecture Notes in Computer Science, 6427Cloud computing is becoming one of the most used paradigms to deploy highly available and scalable systems. These systems usually demand the management of huge amounts of data, which cannot be solved with traditional nor replicated database systems as we know them. Recent solutions store data in special key-value structures, in an approach that commonly lacks the consistency provided by transactional guarantees, as it is traded for high scalability and availability. In order to ensure consistent access to the information, the use of transactions is required. However, it is well-known that traditional replication protocols do not scale well for a cloud environment. Here we take a look at current proposals to deploy transactional systems in the cloud and we propose a new system aiming at being a step forward in achieving this goal. We proceed to focus on data partitioning and describe the key role it plays in achieving high scalability.This work has been partially supported by the Spanish Government under grant TIN2009-14460-C03-02 and by the Spanish MEC under grant BES-2007-17362 and by project ReD Resilient Database Clusters (PDTC/EIA-EIA/109044/2008)

    Dynamic Group Diffie-Hellman Key Exchange under Standard Assumptions

    Get PDF
    Authenticated Diffie-Hellman key exchange allows two principals communicating over a public network, and each holding public /private keys, to agree on a shared secret value. In this paper we study the natural extension of this cryptographic problem to a group of principals. We begin from existing formal security models and refine them to incorporate major missing details (e.g., strong-corruption and concurrent sessions). Within this model we define the execution of a protocol for authenticated dynamic group Diffie-Hellman and show that it is provably secure under the decisional Diffie-Hellman assumption. Our security result holds in the standard model and thus provides better security guarantees than previously published results in the random oracle model

    On Correctness of Data Structures under Reads-Write Concurrency

    Get PDF
    Abstract. We study the correctness of shared data structures under reads-write concurrency. A popular approach to ensuring correctness of read-only operations in the presence of concurrent update, is read-set validation, which checks that all read variables have not changed since they were first read. In practice, this approach is often too conserva-tive, which adversely affects performance. In this paper, we introduce a new framework for reasoning about correctness of data structures under reads-write concurrency, which replaces validation of the entire read-set with more general criteria. Namely, instead of verifying that all read conditions over the shared variables, which we call base conditions. We show that reading values that satisfy some base condition at every point in time implies correctness of read-only operations executing in parallel with updates. Somewhat surprisingly, the resulting correctness guarantee is not equivalent to linearizability, and is instead captured through two new conditions: validity and regularity. Roughly speaking, the former re-quires that a read-only operation never reaches a state unreachable in a sequential execution; the latter generalizes Lamport’s notion of regular-ity for arbitrary data structures, and is weaker than linearizability. We further extend our framework to capture also linearizability. We illus-trate how our framework can be applied for reasoning about correctness of a variety of implementations of data structures such as linked lists.

    On the Efficiency of Atomic Multi-reader, Multi-writer Distributed Memory

    Full text link
    This paper considers quorum-replicated, multi-writer, multi-reader (MWMR) implementations of surviv-able atomic registers in a distributed message-passing system with processors prone to failures. Previous implementations in such settings invariably required two rounds of communication between readers/writers and replica owners. Hence the question arises whether it is possible to have single round read and/or write operations in this setting. As a first step, we present an algorithm, called CWFR, that allows the classic two round write operations, while supporting single round read operations. Since multiple write operations may be concurrent with a read operation, this algorithm involves an iterative (local) discovery of the latest completed write operation. This algorithm precipitates the question of whether fast (single round) writes may co-exist with fast reads. We thus devise a second algorithm, called SFW, that exploits a new technique called server side ordering (SSO), which –unlike previous approaches – places partial responsibility for the ordering of write operations on the replica owners (the servers). With SSO, fast write operations are introduced for the very first time in the MWMR setting. While this is possible, we show that under certain conditions the MWMR model imposes in-herent limitations on any quorum-based fast write implementation of a safe read/write register and potentiall

    On the Performance of a CORBA Caching Service over the Wide Internet

    No full text
    • …
    corecore