Search CORE

5,249 research outputs found

The Impact of RDMA on Agreement

Author: Aguilera Marcos K.
Ben-David Naama
Guerraoui Rachid
Marathe Virendra
Zablotchi Igor
Publication venue
Publication date: 03/03/2020
Field of study

Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we study the fundamental power of these capabilities. We consider the well-known problem of achieving consensus despite failures, and find that RDMA can improve the inherent trade-off in distributed computing between failure resilience and performance. Specifically, we show that RDMA allows algorithms that simultaneously achieve high resilience and high performance, while traditional algorithms had to choose one or another. With Byzantine failures, we give an algorithm that only requires

n \geq 2f_P + 1

processes (where

f_P

is the maximum number of faulty processes) and decides in two (network) delays in common executions. With crash failures, we give an algorithm that only requires

n \geq f_P + 1

processes and also decides in two delays. Both algorithms tolerate a minority of memory failures inherent to RDMA, and they provide safety in asynchronous systems and liveness with standard additional assumptions.Comment: Full version of PODC'19 paper, strengthened broadcast algorith

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

The Hipeac Vision, 2010

Author: Cohen Albert
De Bosschere Koen
De Sutter Bjorn
Duranton Marc
Falsafi Babak
Gaydadjiev Georgi
Katevenis Manolis
Maebe Jonas
Munk Harm
Navarro Nacho
Ramirez Alex
Temam Olivier
Valero Matero
Yehia Sami
Publication venue: HiPEAC
Publication date: 01/01/2010
Field of study

Ghent University Academic Bibliography

Archivsystem Ask23

Exploiting non-constant safe memory in resilient algorithms and data structures

Author: DE STEFANI LORENZO
SILVESTRI FRANCESCO
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

We extend the Faulty RAM model by Finocchi and Italiano (2008) by adding a safe memory of arbitrary size

S

, and we then derive tradeoffs between the performance of resilient algorithmic techniques and the size of the safe memory. Let

\delta

and

\alpha

denote, respectively, the maximum amount of faults which can happen during the execution of an algorithm and the actual number of occurred faults, with

\alpha \leq \delta

. We propose a resilient algorithm for sorting

n

entries which requires

O\left(n\log n+\alpha (\delta/S + \log S)\right)

time and uses

\Theta(S)

safe memory words. Our algorithm outperforms previous resilient sorting algorithms which do not exploit the available safe memory and require

O\left(n\log n+ \alpha\delta\right)

time. Finally, we exploit our sorting algorithm for deriving a resilient priority queue. Our implementation uses

\Theta(S)

safe memory words and

\Theta(n)

faulty memory words for storing

n

keys, and requires

O\left(\log n + \delta/S\right)

amortized time for each insert and deletemin operation. Our resilient priority queue improves the

O\left(\log n + \delta\right)

amortized time required by the state of the art.Comment: To appear in Theoretical Computer Science, 201

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Padova

Havens: Explicit Reliable Memory Regions for HPC Applications

Author: Engelmann Christian
Hukerikar Saurabh
Publication venue
Publication date: 26/10/2016
Field of study

Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.Comment: 2016 IEEE High Performance Extreme Computing Conference (HPEC '16), September 2016, Waltham, MA, US

arXiv.org e-Print Archive

Crossref