5,063 research outputs found

    The Impact of RDMA on Agreement

    Full text link
    Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we study the fundamental power of these capabilities. We consider the well-known problem of achieving consensus despite failures, and find that RDMA can improve the inherent trade-off in distributed computing between failure resilience and performance. Specifically, we show that RDMA allows algorithms that simultaneously achieve high resilience and high performance, while traditional algorithms had to choose one or another. With Byzantine failures, we give an algorithm that only requires n2fP+1n \geq 2f_P + 1 processes (where fPf_P is the maximum number of faulty processes) and decides in two (network) delays in common executions. With crash failures, we give an algorithm that only requires nfP+1n \geq f_P + 1 processes and also decides in two delays. Both algorithms tolerate a minority of memory failures inherent to RDMA, and they provide safety in asynchronous systems and liveness with standard additional assumptions.Comment: Full version of PODC'19 paper, strengthened broadcast algorith

    Exploiting non-constant safe memory in resilient algorithms and data structures

    Get PDF
    We extend the Faulty RAM model by Finocchi and Italiano (2008) by adding a safe memory of arbitrary size SS, and we then derive tradeoffs between the performance of resilient algorithmic techniques and the size of the safe memory. Let δ\delta and α\alpha denote, respectively, the maximum amount of faults which can happen during the execution of an algorithm and the actual number of occurred faults, with αδ\alpha \leq \delta. We propose a resilient algorithm for sorting nn entries which requires O(nlogn+α(δ/S+logS))O\left(n\log n+\alpha (\delta/S + \log S)\right) time and uses Θ(S)\Theta(S) safe memory words. Our algorithm outperforms previous resilient sorting algorithms which do not exploit the available safe memory and require O(nlogn+αδ)O\left(n\log n+ \alpha\delta\right) time. Finally, we exploit our sorting algorithm for deriving a resilient priority queue. Our implementation uses Θ(S)\Theta(S) safe memory words and Θ(n)\Theta(n) faulty memory words for storing nn keys, and requires O(logn+δ/S)O\left(\log n + \delta/S\right) amortized time for each insert and deletemin operation. Our resilient priority queue improves the O(logn+δ)O\left(\log n + \delta\right) amortized time required by the state of the art.Comment: To appear in Theoretical Computer Science, 201

    Havens: Explicit Reliable Memory Regions for HPC Applications

    Full text link
    Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.Comment: 2016 IEEE High Performance Extreme Computing Conference (HPEC '16), September 2016, Waltham, MA, US
    corecore