Search CORE

1,469 research outputs found

Lazy Fault Recovery for Redundant Mpi

Author: Saliba Elie
Publication venue: DigitalCommons@CalPoly
Publication date: 01/06/2019
Field of study

Distributed Systems (DS) where multiple computers share a workload across a network, are used everywhere, from data intensive computations to storage and machine learning. DS provide a relatively cheap and efficient solution that allows stability with improved performance for computational intensive applications. In a DS faults and failures are the norm not the exception. At any moment data corruption can occur especially since a DS usually consists of hundred to thousands of units of commodity hardware. The large number and quality of components guarantees, by probability, that at any given time some of the components will not be working and some of them will not recover from failure. DS can experience problems caused by application bugs, operating systems bugs, failures with disks, memory, connectors, networking, power supply, and other components; therefore, constant monitoring and failure detection are fundamental. Automatic recovery must be integral to the system. One of the most commonly used programming languages for DS is Message Passing Interface (MPI). Unfortunately MPI does not support fault detection or recovery. In this thesis, we build a recovery mechanism based on replicas that works on top of the asynchronous fault detection implemented in previous work. Results shows that our recovery implementation is successful and the overhead in execution time is minimal

DigitalCommons@CalPoly

Towards bounded wait-free PASIS

Author: Abd-El-Malek Michael
Ganger Gregory R.
Goodson Garth R.
Reiter Michael K.
Wylie Jay J.
Publication venue: Dagstuhl Seminar Proceedings. 06371 - From Security to Dependability
Publication date: 01/01/2007
Field of study

The PASIS read/write protocol implements a Byzantine fault-tolerant erasure-coded atomic register. The prototype PASIS storage system implementation provides excellent best-case performance. Writes require two round trips and contention- and failure-free reads require one. Unfortunately, even though writes and reads are wait-free in PASIS, Byzantine components can induce correct clients to perform an unbounded amount of work. In this extended abstract, we enumerate the avenues by which Byzantine servers and clients can induce correct clients to perform an unbounded amount of work in PASIS. We sketch extensions to the PASIS protocol and Lazy Verification that bound the amount of work Byzantine components can induce correct clients to perform. We believe that the extensions provide bounded wait-free reads and writes. We also believe that an implementation that incorporates these extensions will preserve the excellent best-case performance of the original PASIS prototype

Dagstuhl Research Online Publication Server

Formalisms for program reification and fault tolerance

Author: Coenen J.A.A.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/1994
Field of study

Repository TU/e

Pure OAI Repository

Lazy Fault Detection for Redundant MPI

Author: Ford Corey
Publication venue: DigitalCommons@CalPoly
Publication date: 01/06/2016
Field of study

As the scale of supercomputers grows, it is becoming increasingly important for software to efficiently withstand hardware and software faults. Process replication is one resilience technique, but typical implementations require replicas to stay closely synchronized with each other. We propose algorithms to lazily detect faults in replicated MPI applications, allowing for more flexibility in replica scheduling and potential power savings. Evaluation shows that, when all processes are operated at full power, this approach allows applications to complete substantially faster as compared to using a synchronized model, and often as fast as in non-replicated execution

DigitalCommons@CalPoly

State of The Art and Hot Aspects in Cloud Data Storage Security

Author: Gehrmann Christian
Morenius Fredric
Paladi Nicolae
Publication venue: Swedish Institute of Computer Science
Publication date: 01/01/2013
Field of study

Along with the evolution of cloud computing and cloud storage towards matu- rity, researchers have analyzed an increasing range of cloud computing security aspects, data security being an important topic in this area. In this paper, we examine the state of the art in cloud storage security through an overview of selected peer reviewed publications. We address the question of defining cloud storage security and its different aspects, as well as enumerate the main vec- tors of attack on cloud storage. The reviewed papers present techniques for key management and controlled disclosure of encrypted data in cloud storage, while novel ideas regarding secure operations on encrypted data and methods for pro- tection of data in fully virtualized environments provide a glimpse of the toolbox available for securing cloud storage. Finally, new challenges such as emergent government regulation call for solutions to problems that did not receive enough attention in earlier stages of cloud computing, such as for example geographical location of data. The methods presented in the papers selected for this review represent only a small fraction of the wide research effort within cloud storage security. Nevertheless, they serve as an indication of the diversity of problems that are being addressed

CiteSeerX

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

How to tell if your cloud files are vulnerable to drive crashes

Author: Alina Oprea
Ari Juels
Citable Link
Kevin D. Bowers
Marten Van Dijk
Ronald L. Rivest
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

This paper presents a new challenge--verifying that a remote server is storing a file in a fault-tolerant manner, i.e., such that it can survive hard-drive failures. We describe an approach called the Remote Assessment of Fault Tolerance (RAFT). The key technique in a RAFT is to measure the time taken for a server to respond to a read request for a collection of file blocks. The larger the number of hard drives across which a file is distributed, the faster the read-request response. Erasure codes also play an important role in our solution. We describe a theoretical framework for RAFTs and offer experimental evidence that RAFTs can work in practice in several settings of interest

CiteSeerX

DSpace@MIT

Crossref

Cryptology ePrint Archive

Synchronous Counting and Computational Algorithm Design

Author: Dolev Danny
Heljanko Keijo
Järvisalo Matti
Korhonen Janne H.
Lenzen Christoph
Rybicki Joel
Suomela Jukka
Wieringa Siert
Publication venue: 'Elsevier BV'
Publication date: 05/01/2015
Field of study

Consider a complete communication network on

n

nodes, each of which is a state machine. In synchronous 2-counting, the nodes receive a common clock pulse and they have to agree on which pulses are "odd" and which are "even". We require that the solution is self-stabilising (reaching the correct operation from any initial state) and it tolerates

f

Byzantine failures (nodes that send arbitrary misinformation). Prior algorithms are expensive to implement in hardware: they require a source of random bits or a large number of states. This work consists of two parts. In the first part, we use computational techniques (often known as synthesis) to construct very compact deterministic algorithms for the first non-trivial case of

f = 1

. While no algorithm exists for

n < 4

, we show that as few as 3 states per node are sufficient for all values

n \ge 4

. Moreover, the problem cannot be solved with only 2 states per node for

n = 4

, but there is a 2-state solution for all values

n \ge 6

. In the second part, we develop and compare two different approaches for synthesising synchronous counting algorithms. Both approaches are based on casting the synthesis problem as a propositional satisfiability (SAT) problem and employing modern SAT-solvers. The difference lies in how to solve the SAT problem: either in a direct fashion, or incrementally within a counter-example guided abstraction refinement loop. Empirical results suggest that the former technique is more efficient if we want to synthesise time-optimal algorithms, while the latter technique discovers non-optimal algorithms more quickly.Comment: 35 pages, extended and revised versio

arXiv.org e-Print Archive

MPG.PuRe

Reliable massively parallel symbolic computing : fault tolerance for a distributed Haskell

Author: Stewart Robert
Publication venue: Mathematical and Computer Sciences
Publication date: 01/01/2013
Field of study

As the number of cores in manycore systems grows exponentially, the number of failures is also predicted to grow exponentially. Hence massively parallel computations must be able to tolerate faults. Moreover new approaches to language design and system architecture are needed to address the resilience of massively parallel heterogeneous architectures. Symbolic computation has underpinned key advances in Mathematics and Computer Science, for example in number theory, cryptography, and coding theory. Computer algebra software systems facilitate symbolic mathematics. Developing these at scale has its own distinctive set of challenges, as symbolic algorithms tend to employ complex irregular data and control structures. SymGridParII is a middleware for parallel symbolic computing on massively parallel High Performance Computing platforms. A key element of SymGridParII is a domain specific language (DSL) called Haskell Distributed Parallel Haskell (HdpH). It is explicitly designed for scalable distributed-memory parallelism, and employs work stealing to load balance dynamically generated irregular task sizes. To investigate providing scalable fault tolerant symbolic computation we design, implement and evaluate a reliable version of HdpH, HdpH-RS. Its reliable scheduler detects and handles faults, using task replication as a key recovery strategy. The scheduler supports load balancing with a fault tolerant work stealing protocol. The reliable scheduler is invoked with two fault tolerance primitives for implicit and explicit work placement, and 10 fault tolerant parallel skeletons that encapsulate common parallel programming patterns. The user is oblivious to many failures, they are instead handled by the scheduler. An operational semantics describes small-step reductions on states. A simple abstract machine for scheduling transitions and task evaluation is presented. It defines the semantics of supervised futures, and the transition rules for recovering tasks in the presence of failure. The transition rules are demonstrated with a fault-free execution, and three executions that recover from faults. The fault tolerant work stealing has been abstracted in to a Promela model. The SPIN model checker is used to exhaustively search the intersection of states in this automaton to validate a key resiliency property of the protocol. It asserts that an initially empty supervised future on the supervisor node will eventually be full in the presence of all possible combinations of failures. The performance of HdpH-RS is measured using five benchmarks. Supervised scheduling achieves a speedup of 757 with explicit task placement and 340 with lazy work stealing when executing Summatory Liouville up to 1400 cores of a HPC architecture. Moreover, supervision overheads are consistently low scaling up to 1400 cores. Low recovery overheads are observed in the presence of frequent failure when lazy on-demand work stealing is used. A Chaos Monkey mechanism has been developed for stress testing resiliency with random failure combinations. All unit tests pass in the presence of random failure, terminating with the expected results

CiteSeerX

ROS: The Research Output Service. Heriot-Watt University Edinburgh