Search CORE

181 research outputs found

Workrs: Fault Tolerant Horizontal Computation Offloading

Author: Albano Michele
Bohnstedt Rune
Carstensen Jacob
Droob Alexander
Jakobsen Frederik Langkilde
Mathiesen Magnus
Moreschini Sergio
Morratz Daniel
Taibi Davide
Publication venue
Publication date: 24/05/2023
Field of study

The broad development and usage of edge devices has highlighted the importance of creating resilient and computationally advanced environments. When working with edge devices these desiderata are usually achieved through replication and offloading. This paper reports on the design and implementation of Workrs, a fault tolerant service that enables the offloading of jobs from devices with limited computational power. We propose a solution that allows users to upload jobs through a web service, which will be executed on edge nodes within the system. The solution is designed to be fault tolerant and scalable, with no single point of failure as well as the ability to accommodate growth, if the service is expanded. The use of Docker checkpointing on the worker machines ensures that jobs can be resumed in the event of a fault. We provide a mathematical approach to optimize the number of checkpoints that are created along a computation, given that we can forecast the time needed to execute a job. We present experiments that indicate in which scenarios checkpointing benefits job execution. The results achieved are based on a working prototype which shows clear benefits of using checkpointing and restore when the completion jobs' time rises compared with the forecast fault rate. The code of Workrs is released as open source, and it is available at \url{https://github.com/orgs/P7-workrs/repositories}. This paper is an extended version of \cite{edge2023paper}.Comment: extended version of a paper accepted at IEEE Edge 202

arXiv.org e-Print Archive

Eventual fault recovery strategies for Byzantine failures

Author: Zahdeh Alexander A
Publication venue
Publication date: 01/12/2015
Field of study

Byzantine faults in distributed systems can have very destructive consequences for services built on top of these systems but are not commonly tolerated in production systems due to the overhead and scalability limitations with existing approaches such as Byzantine fault tolerance. This work describes a reactive protocol for recovering from Byzantine failures in replicated state machines. In contrast to traditional Byzantine fault tolerance (BFT), which attempts to mask faults, this protocol is designed to allow faults to be exposed to clients but ensures that no client can fork the state of the system by rolling back faulty updates once they are detected. This ensures that, in spite of Byzantine failures, the system will always converge to a consistent state. The system provides a contract to the client called lapse consistency that bounds the number of inconsistent reads that can be experienced as a result of the rollbacks that it performs. This system extends prior work on Byzantine detection to provide an integrated system that can not only eventually detect, but also respond to Byzantine faults with provable consistency semantics while preserving many of the important properties of Byzantine detection such as scalability, and responsiveness. We evaluate the overhead of a proof of concept implementation of the system

Illinois Digital Environment for Access to Learning and Scholarship Repository

Reliable massively parallel symbolic computing : fault tolerance for a distributed Haskell

Author: Stewart Robert
Publication venue: Mathematical and Computer Sciences
Publication date: 01/01/2013
Field of study

As the number of cores in manycore systems grows exponentially, the number of failures is also predicted to grow exponentially. Hence massively parallel computations must be able to tolerate faults. Moreover new approaches to language design and system architecture are needed to address the resilience of massively parallel heterogeneous architectures. Symbolic computation has underpinned key advances in Mathematics and Computer Science, for example in number theory, cryptography, and coding theory. Computer algebra software systems facilitate symbolic mathematics. Developing these at scale has its own distinctive set of challenges, as symbolic algorithms tend to employ complex irregular data and control structures. SymGridParII is a middleware for parallel symbolic computing on massively parallel High Performance Computing platforms. A key element of SymGridParII is a domain specific language (DSL) called Haskell Distributed Parallel Haskell (HdpH). It is explicitly designed for scalable distributed-memory parallelism, and employs work stealing to load balance dynamically generated irregular task sizes. To investigate providing scalable fault tolerant symbolic computation we design, implement and evaluate a reliable version of HdpH, HdpH-RS. Its reliable scheduler detects and handles faults, using task replication as a key recovery strategy. The scheduler supports load balancing with a fault tolerant work stealing protocol. The reliable scheduler is invoked with two fault tolerance primitives for implicit and explicit work placement, and 10 fault tolerant parallel skeletons that encapsulate common parallel programming patterns. The user is oblivious to many failures, they are instead handled by the scheduler. An operational semantics describes small-step reductions on states. A simple abstract machine for scheduling transitions and task evaluation is presented. It defines the semantics of supervised futures, and the transition rules for recovering tasks in the presence of failure. The transition rules are demonstrated with a fault-free execution, and three executions that recover from faults. The fault tolerant work stealing has been abstracted in to a Promela model. The SPIN model checker is used to exhaustively search the intersection of states in this automaton to validate a key resiliency property of the protocol. It asserts that an initially empty supervised future on the supervisor node will eventually be full in the presence of all possible combinations of failures. The performance of HdpH-RS is measured using five benchmarks. Supervised scheduling achieves a speedup of 757 with explicit task placement and 340 with lazy work stealing when executing Summatory Liouville up to 1400 cores of a HPC architecture. Moreover, supervision overheads are consistently low scaling up to 1400 cores. Low recovery overheads are observed in the presence of frequent failure when lazy on-demand work stealing is used. A Chaos Monkey mechanism has been developed for stress testing resiliency with random failure combinations. All unit tests pass in the presence of random failure, terminating with the expected results

CiteSeerX

ROS: The Research Output Service. Heriot-Watt University Edinburgh

Approaches to Designing Complex Dependable Systems

Author: Cazzola W
Clematis A
Gianuzzi V
Romanovsky A
Tyrrell AM
Publication venue
Publication date
Field of study

Newcastle University E-Prints

Fault tolerant techniques for finite state machines in hardware designs

Author: te Slaa Pim S.
Publication venue
Publication date: 29/04/2020
Field of study

Pure OAI Repository

Explicit Representation of Exception Handling in the Development of Dependable Component-Based Systems

Author: de Lemos R.
Ferreira G.R.M.
Rubira C.M.F.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2001
Field of study

Exception handling is a structuring technique that facilitates the design of systems by encapsulating the process of error recovery. In this paper, we present a systematic approach for incorporating exceptional behaviour in the development of component-based software. The premise of our approach is that components alone do not provide the appropriate means to deal with exceptional behaviour in an effective manner. Hence the need to consider the notion of collaborations for capturing the interactive behaviour between components, when error recovery involves more than one component. The feasibility of the approach is demonstrated in terms of the case study of the mining control system

Kent Academic Repository

Reliable DAG Scheduling on Grids with Rewinding and Migration

Author: Cole Murray
Hernandez Israel
Publication venue
Publication date: 01/01/2007
Field of study

Crossref

Edinburgh Research Explorer

Fault-free performance validation of fault-tolerant multiprocessors

Author: Czeck Edward W.
Feather Frank E.
Grizzaffi Ann Marie
Segall Zary Z.
Siewiorek Daniel P.
Publication venue
Publication date
Field of study

A validation methodology for testing the performance of fault-tolerant computer systems was developed and applied to the Fault-Tolerant Multiprocessor (FTMP) at NASA-Langley's AIRLAB facility. This methodology was claimed to be general enough to apply to any ultrareliable computer system. The goal of this research was to extend the validation methodology and to demonstrate the robustness of the validation methodology by its more extensive application to NASA's Fault-Tolerant Multiprocessor System (FTMP) and to the Software Implemented Fault-Tolerance (SIFT) Computer System. Furthermore, the performance of these two multiprocessors was compared by conducting similar experiments. An analysis of the results shows high level language instruction execution times for both SIFT and FTMP were consistent and predictable, with SIFT having greater throughput. At the operating system level, FTMP consumes 60% of the throughput for its real-time dispatcher and 5% on fault-handling tasks. In contrast, SIFT consumes 16% of its throughput for the dispatcher, but consumes 66% in fault-handling software overhead

NASA Technical Reports Server

Cryptographic approaches for confidential computations in blockchain.

Author: Agudo-Ruiz Isaac
Morales Escalera Daniel
Publication venue
Publication date: 01/01/2023
Field of study

Blockchain technologies have been widely re- searched in the last decade, mainly because of the revolution they propose for different use cases. Moving away from centralized solutions that abuse their capabilities, blockchain looks like a great solution for integrity, transparency, and decentral- ization. However, there are still some problems to be solved, lack of privacy being one of the main ones. In this paper, we focus on a subset of the privacy area, which is confidentiality. Although users are increasingly aware of the importance of confidentiality, blockchain poses a barrier to the confidential treatment of data. We initiate the study of cryptographic confidential computing tools and focus on how these technologies can endow the blockchain with better capabilities, i.e., enable rich and versatile applications while pro- tecting users’ data. We identify Zero Knowledge Proofs, Fully Homomorphic Encryption, and Se- cure Multiparty Computation as good candidates to achieve this.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

Repositorio Institucional Universidad de Málaga