Search CORE

389 research outputs found

Implementing fault tolerant applications using reflective object-oriented programming

Author: Fabre Jean-Charles
Nicomette Vincent
Pérennou Tanguy
Stroud Robert
Wu Zhixue
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/1995
Field of study

Abstract: Shows how reflection and object-oriented programming can be used to ease the implementation of classical fault tolerance mechanisms in distributed applications. When the underlying runtime system does not provide fault tolerance transparently, classical approaches to implementing fault tolerance mechanisms often imply mixing functional programming with non-functional programming (e.g. error processing mechanisms). The use of reflection improves the transparency of fault tolerance mechanisms to the programmer and more generally provides a clearer separation between functional and non-functional programming. The implementations of some classical replication techniques using a reflective approach are presented in detail and illustrated by several examples, which have been prototyped on a network of Unix workstations. Lessons learnt from our experiments are drawn and future work is discussed

Open Archive Toulouse Archive Ouverte

Constructing fail-controlled nodes for distributed systems: a software approach

Author: Brasileiro Francisco Vilar
Publication venue: Newcastle University
Publication date: 01/01/1995
Field of study

PhD ThesisDesigning and implementing distributed systems which continue to provide specified services in the presence of processing site and communication failures is a difficult task. To facilitate their development, distributed systems have been built assuming that their underlying hardware components are Jail-controlled, i.e. present a well defined failure mode. However, if conventional hardware cannot provide the assumed failure mode, there is a need to build processing sites or nodes, and communication infra-structure that present the fail-controlled behaviour assumed. Coupling a number of redundant processors within a replicated node is a well known way of constructing fail-controlled nodes. Computation is replicated and executed simultaneously at each processor, and by employing suitable validation techniques to the outputs generated by processors (e.g. majority voting, comparison), outputs from faulty processors can be prevented from appearing at the application level. One way of constructing replicated nodes is by introducing hardwired mechanisms to couple replicated processors with specialised validation hardware circuits. Processors are tightly synchronised at the clock cycle level, and have their outputs validated by a reliable validation hardware. Another approach is to use software mechanisms to perform synchronisation of processors and validation of the outputs. The main advantage of hardware based nodes is the minimum performance overhead incurred. However, the introduction of special circuits may increase the complexity of the design tremendously. Further, every new microprocessor architecture requires considerable redesign overhead. Software based nodes do not present these problems, on the other hand, they introduce much bigger performance overheads to the system. In this thesis we investigate alternative ways of constructing efficient fail-controlled, software based replicated nodes. In particular, we present much more efficient order protocols, which are necessary for the implementation of these nodes. Our protocols, unlike others published to date, do not require processors' physical clocks to be explicitly synchronised. The main contribution of this thesis is the precise definition of the semantics of a software based Jail-silent node, along with its efficient design, implementation and performance evaluation.The Brazilian National Research Council (CNPq/Brasil)

Newcastle University eTheses

Asynchronous Teams and Tasks in a Message Passing Environment

Author: HAZELWOOD BENJAMIN
Publication venue
Publication date: 01/01/2019
Field of study

As the discipline of scientific computing grows, so too does the "skills gap" between the increasingly complex scientific applications and the efficient algorithms required. Increasing demand for computational power on the march towards exascale requires innovative approaches. Closing the skills gap avoids the many pitfalls that lead to poor utilisation of resources and wasted investment. This thesis tackles two challenges: asynchronous algorithms for parallel computing and fault tolerance. First I present a novel asynchronous task invocation methodology for Discontinuous Galerkin codes called enclave tasking. The approach modifies the parallel ordering of tasks that allows for efficient scaling on dynamic meshes up to 756 cores. It ensures high levels of concurrency and intermixes tasks of different computational properties. Critical tasks along domain boundaries are prioritised for an overlap of computation and communication. The second contribution is the teaMPI library, forming teams of MPI processes exchanging consistency data through an asynchronous "heartbeat". In contrast to previous approaches, teaMPI operates fully asynchronously with reduced overhead. It is also capable of detecting individually slow or failing ranks and inconsistent data among replicas. Finally I provide an outlook into how asynchronous teams using enclave tasking can be combined into an advanced team-based diffusive load balancing scheme. Both concepts are integrated into and contribute towards the ExaHyPE project, a next generation code that solves hyperbolic equation systems on dynamically adaptive cartesian grids

Durham e-Theses

Replication and fault-tolerance in real-time systems

Author: Waterworth Adrian
Publication venue: Newcastle University
Publication date: 01/01/1992
Field of study

PhD ThesisThe increased availability of sophisticated computer hardware and the corresponding decrease in its cost has led to a widespread growth in the use of computer systems for realtime plant and process control applications. Such applications typically place very high demands upon computer control systems and the development of appropriate control software for these application areas can present a number of problems not normally encountered in other applications. First of all, real-time applications must be correct in the time domain as well as the value domain: returning results which are not only correct but also delivered on time. Further, since the potential for catastrophic failures can be high in a process or plant control environment, many real-time applications also have to meet high reliability requirements. These requirements will typically be met by means of a combination of fault avoidance and fault tolerance techniques. This thesis is intended to address some of the problems encountered in the provision of fault tolerance in real-time applications programs. Specifically,it considers the use of replication to ensure the availability of services in real-time systems. In a real-time environment, providing support for replicated services can introduce a number of problems. In particular, the scope for non-deterministic behaviour in real-time applications can be quite large and this can lead to difficultiesin maintainingconsistent internal states across the members of a replica group. To tackle this problem, a model is proposed for fault tolerant real-time objects which not only allows such objects to perform application specific recovery operations and real-time processing activities such as event handling, but which also allows objects to be replicated. The architectural support required for such replicated objects is also discussed and, to conclude, the run-time overheads associated with the use of such replicated services are considered.The Science and Engineering Research Council

Newcastle University eTheses

Conception et implémentation de systèmes résilients par une approche à composants

Author: Stoicescu Miruna
Publication venue: École Doctorale Systèmes (Toulouse);154236462
Publication date: 09/12/2013
Field of study

L'évolution des systèmes pendant leur vie opérationnelle est incontournable. Les systèmes sûrs de fonctionnement doivent évoluer pour s'adapter à des changements comme la confrontation à de nouveaux types de fautes ou la perte de ressources. L'ajout de cette dimension évolutive à la fiabilité conduit à la notion de résilience informatique. Parmi les différents aspects de la résilience, nous nous concentrons sur l'adaptativité. La sûreté de fonctionnement informatique est basée sur plusieurs moyens, dont la tolérance aux fautes à l'exécution, où l'on attache des mécanismes spécifiques (Fault Tolerance Mechanisms, FTMs) à l'application. A ce titre, l'adaptation des FTMs à l'exécution s'avère un défi pour développer des systèmes résilients. Dans la plupart des travaux de recherche existants, l'adaptation des FTMs à l'exécution est réalisée de manière préprogrammée ou se limite à faire varier quelques paramètres. Tous les FTMs envisageables doivent être connus dès le design du système et déployés et attachés à l'application dès le début. Pourtant, les changements ont des origines variées et, donc, vouloir équiper un système pour le pire scénario est impossible. Selon les observations pendant la vie opérationnelle, de nouveaux FTMs peuvent être développés hors-ligne, mais intégrés pendant l'exécution. On dénote cette capacité comme adaptation agile, par opposition à l'adaptation préprogrammée. Dans cette thèse, nous présentons une approche pour développer des systèmes sûrs de fonctionnement flexibles dont les FTMs peuvent s'adapter à l'exécution de manière agile par des modifications à grain fin pour minimiser l'impact sur l'architecture initiale. D'abord, nous proposons une classification d'un ensemble de FTMs existants basée sur des critères comme le modèle de faute, les caractéristiques de l'application et les ressources nécessaires. Ensuite, nous analysons ces FTMs et extrayons un schéma d'exécution générique identifiant leurs parties communes et leurs points de variabilité. Après, nous démontrons les bénéfices apportés par les outils et les concepts issus du domaine du génie logiciel, comme les intergiciels réflexifs à base de composants, pour développer une librairie de FTMs adaptatifs à grain fin. Nous évaluons l'agilité de l'approche et illustrons son utilité à travers deux exemples d'intégration : premièrement, dans un processus de développement dirigé par le design pour les systèmes ubiquitaires et, deuxièmement, dans un environnement pour le développement d'applications pour des réseaux de capteurs. ABSTRACT : Evolution during service life is mandatory, particularly for long-lived systems. Dependable systems, which continuously deliver trustworthy services, must evolve to accommodate changes e.g., new fault tolerance requirements or variations in available resources. The addition of this evolutionary dimension to dependability leads to the notion of resilient computing. Among the various aspects of resilience, we focus on adaptivity. Dependability relies on fault tolerant computing at runtime, applications being augmented with fault tolerance mechanisms (FTMs). As such, on-line adaptation of FTMs is a key challenge towards resilience. In related work, on-line adaption of FTMs is most often performed in a preprogrammed manner or consists in tuning some parameters. Besides, FTMs are replaced monolithically. All the envisaged FTMs must be known at design time and deployed from the beginning. However, dynamics occurs along multiple dimensions and developing a system for the worst-case scenario is impossible. According to runtime observations, new FTMs can be developed off-line but integrated on-line. We denote this ability as agile adaption, as opposed to the preprogrammed one. In this thesis, we present an approach for developing flexible fault-tolerant systems in which FTMs can be adapted at runtime in an agile manner through fine-grained modifications for minimizing impact on the initial architecture. We first propose a classification of a set of existing FTMs based on criteria such as fault model, application characteristics and necessary resources. Next, we analyze these FTMs and extract a generic execution scheme which pinpoints the common parts and the variable features between them. Then, we demonstrate the use of state-of-the-art tools and concepts from the field of software engineering, such as component-based software engineering and reflective component-based middleware, for developing a library of fine-grained adaptive FTMs. We evaluate the agility of the approach and illustrate its usability throughout two examples of integration of the library: first, in a design-driven development process for applications in pervasive computing and, second, in a toolkit for developing applications for WSNs

Thèses en Ligne

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

Institut National Polytechnique de Toulouse (Theses)

HAL-INSA Toulouse

Theses.fr

Reusable and Extensible Fault Tolerance for RESTful Applications

Author: Eli Tilevich
John Edstrom
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

Abstract—Despite the simplicity and scalability benefits of REST, rendering RESTful web applications fault-tolerant requires that the programmer write vast amounts of non-trivial, ad-hoc code. Network volatility, HTTP server errors, service outages—all require custom fault handling code, whose effective implementation requires considerable programming expertise and effort. To provide a systematic and principled ap-proach to handling faults in RESTful applications, we present FT-REST—an architectural framework for specifying fault tolerance functionality declaratively and then translating these specifications into platform-specific code. FT-REST encapsu-lates fault tolerance strategies in XML-based specifications and compiles them to modules that reify the requisite fault tolerance. To validate our approach, we have applied FT-REST to enhance several realistic RESTful applications to withstand the faults described in their FT-REST specifications. As REST is said to apply verbs (HTTP commands) to nouns (URIs), FT-REST enhances this conceptual model with adverbs that render REST reliable via reusable and extensible fault tolerance. Keywords-fault tolerance, web services, REST, software reusability, software extensibilit

CiteSeerX

Crossref

MAFTIA Conceptual Model and Architecture

Author: Adelsbach A.
Cachin C.
Creese S.
Deswarte Y.
Kursawe K.
Laprie J.-C.
Pfitzmann B.
Powell David
Randell B.
Riodan J.
Stroud Robert J.
Veríssimo Paulo
Waidner M.
Welch I. S.
Publication venue: Department of Informatics, University of Lisbon
Publication date: 01/11/2001
Field of study

This document builds on the work reported in MAFTIA deliverable D1. It contains a refinement of the MAFTIA conceptual model and a discussion of the MAFTIA architecture. It also introduces the work done in WP6 on verification and assessment of security properties, which is reported on in more detail in MAFTIA deliverable D

Universidade de Lisboa: Repositório.UL

Data Storage and Dissemination in Pervasive Edge Computing Environments

Author: Silva João André Almeida e
Publication venue
Publication date: 01/05/2021
Field of study

Nowadays, smart mobile devices generate huge amounts of data in all sorts of gatherings. Much of that data has localized and ephemeral interest, but can be of great use if shared among co-located devices. However, mobile devices often experience poor connectivity, leading to availability issues if application storage and logic are fully delegated to a remote cloud infrastructure. In turn, the edge computing paradigm pushes computations and storage beyond the data center, closer to end-user devices where data is generated and consumed. Hence, enabling the execution of certain components of edge-enabled systems directly and cooperatively on edge devices. This thesis focuses on the design and evaluation of resilient and efficient data storage and dissemination solutions for pervasive edge computing environments, operating with or without access to the network infrastructure. In line with this dichotomy, our goal can be divided into two specific scenarios. The first one is related to the absence of network infrastructure and the provision of a transient data storage and dissemination system for networks of co-located mobile devices. The second one relates with the existence of network infrastructure access and the corresponding edge computing capabilities. First, the thesis presents time-aware reactive storage (TARS), a reactive data storage and dissemination model with intrinsic time-awareness, that exploits synergies between the storage substrate and the publish/subscribe paradigm, and allows queries within a specific time scope. Next, it describes in more detail: i) Thyme, a data storage and dis- semination system for wireless edge environments, implementing TARS; ii) Parsley, a flexible and resilient group-based distributed hash table with preemptive peer relocation and a dynamic data sharding mechanism; and iii) Thyme GardenBed, a framework for data storage and dissemination across multi-region edge networks, that makes use of both device-to-device and edge interactions. The developed solutions present low overheads, while providing adequate response times for interactive usage and low energy consumption, proving to be practical in a variety of situations. They also display good load balancing and fault tolerance properties.Resumo Hoje em dia, os dispositivos móveis inteligentes geram grandes quantidades de dados em todos os tipos de aglomerações de pessoas. Muitos desses dados têm interesse loca- lizado e efêmero, mas podem ser de grande utilidade se partilhados entre dispositivos co-localizados. No entanto, os dispositivos móveis muitas vezes experienciam fraca co- nectividade, levando a problemas de disponibilidade se o armazenamento e a lógica das aplicações forem totalmente delegados numa infraestrutura remota na nuvem. Por sua vez, o paradigma de computação na periferia da rede leva as computações e o armazena- mento para além dos centros de dados, para mais perto dos dispositivos dos utilizadores finais onde os dados são gerados e consumidos. Assim, permitindo a execução de certos componentes de sistemas direta e cooperativamente em dispositivos na periferia da rede. Esta tese foca-se no desenho e avaliação de soluções resilientes e eficientes para arma- zenamento e disseminação de dados em ambientes pervasivos de computação na periferia da rede, operando com ou sem acesso à infraestrutura de rede. Em linha com esta dico- tomia, o nosso objetivo pode ser dividido em dois cenários específicos. O primeiro está relacionado com a ausência de infraestrutura de rede e o fornecimento de um sistema efêmero de armazenamento e disseminação de dados para redes de dispositivos móveis co-localizados. O segundo diz respeito à existência de acesso à infraestrutura de rede e aos recursos de computação na periferia da rede correspondentes. Primeiramente, a tese apresenta armazenamento reativo ciente do tempo (ARCT), um modelo reativo de armazenamento e disseminação de dados com percepção intrínseca do tempo, que explora sinergias entre o substrato de armazenamento e o paradigma pu- blicação/subscrição, e permite consultas num escopo de tempo específico. De seguida, descreve em mais detalhe: i) Thyme, um sistema de armazenamento e disseminação de dados para ambientes sem fios na periferia da rede, que implementa ARCT; ii) Pars- ley, uma tabela de dispersão distribuída flexível e resiliente baseada em grupos, com realocação preventiva de nós e um mecanismo de particionamento dinâmico de dados; e iii) Thyme GardenBed, um sistema para armazenamento e disseminação de dados em redes multi-regionais na periferia da rede, que faz uso de interações entre dispositivos e com a periferia da rede. As soluções desenvolvidas apresentam baixos custos, proporcionando tempos de res- posta adequados para uso interativo e baixo consumo de energia, demonstrando serem práticas nas mais diversas situações. Estas soluções também exibem boas propriedades de balanceamento de carga e tolerância a faltas

Repositório da Universidade Nova de Lisboa

Flexible and transparent fault tolerance for distributed object-oriented applications

Author: Kottmann Dietmar
Schill Alexander B.
Publication venue
Publication date: 02/08/2007
Field of study

This report describes an approach enabling automatic structural reconfigurations of distributed applications based on configuration management in order to compensate for node and network failures. The major goal of the approach is to maintain the relevant application functionality after failures automatically.This goalis achieved by a dedicated system model and by a decentralized reconfiguration algorithm based on it. The system model provides support for redundant application object storage and for application-level consistency based on distributed checkpoints. The reconfiguration algorithm detects failures, computes a compensating configuration, and realizes this new configuration. The report emphasizes flexibility in the sense ofadaptable levels of fault tolerance, as well as transparency in the sense of fully-automatic reaction to failures

KITopen