5,249 research outputs found
The Impact of RDMA on Agreement
Remote Direct Memory Access (RDMA) is becoming widely available in data
centers. This technology allows a process to directly read and write the memory
of a remote host, with a mechanism to control access permissions. In this
paper, we study the fundamental power of these capabilities. We consider the
well-known problem of achieving consensus despite failures, and find that RDMA
can improve the inherent trade-off in distributed computing between failure
resilience and performance. Specifically, we show that RDMA allows algorithms
that simultaneously achieve high resilience and high performance, while
traditional algorithms had to choose one or another. With Byzantine failures,
we give an algorithm that only requires processes (where
is the maximum number of faulty processes) and decides in two (network)
delays in common executions. With crash failures, we give an algorithm that
only requires processes and also decides in two delays. Both
algorithms tolerate a minority of memory failures inherent to RDMA, and they
provide safety in asynchronous systems and liveness with standard additional
assumptions.Comment: Full version of PODC'19 paper, strengthened broadcast algorith
Exploiting non-constant safe memory in resilient algorithms and data structures
We extend the Faulty RAM model by Finocchi and Italiano (2008) by adding a
safe memory of arbitrary size , and we then derive tradeoffs between the
performance of resilient algorithmic techniques and the size of the safe
memory. Let and denote, respectively, the maximum amount of
faults which can happen during the execution of an algorithm and the actual
number of occurred faults, with . We propose a resilient
algorithm for sorting entries which requires time and uses safe memory words. Our
algorithm outperforms previous resilient sorting algorithms which do not
exploit the available safe memory and require time. Finally, we exploit our sorting algorithm for
deriving a resilient priority queue. Our implementation uses safe
memory words and faulty memory words for storing keys, and
requires amortized time for each insert and
deletemin operation. Our resilient priority queue improves the amortized time required by the state of the art.Comment: To appear in Theoretical Computer Science, 201
Havens: Explicit Reliable Memory Regions for HPC Applications
Supporting error resilience in future exascale-class supercomputing systems
is a critical challenge. Due to transistor scaling trends and increasing memory
density, scientific simulations are expected to experience more interruptions
caused by transient errors in the system memory. Existing hardware-based
detection and recovery techniques will be inadequate to manage the presence of
high memory fault rates.
In this paper we propose a partial memory protection scheme based on
region-based memory management. We define the concept of regions called havens
that provide fault protection for program objects. We provide reliability for
the regions through a software-based parity protection mechanism. Our approach
enables critical program objects to be placed in these havens. The fault
coverage provided by our approach is application agnostic, unlike
algorithm-based fault tolerance techniques.Comment: 2016 IEEE High Performance Extreme Computing Conference (HPEC '16),
September 2016, Waltham, MA, US
- …