389 research outputs found

    Implementing fault tolerant applications using reflective object-oriented programming

    Get PDF
    Abstract: Shows how reflection and object-oriented programming can be used to ease the implementation of classical fault tolerance mechanisms in distributed applications. When the underlying runtime system does not provide fault tolerance transparently, classical approaches to implementing fault tolerance mechanisms often imply mixing functional programming with non-functional programming (e.g. error processing mechanisms). The use of reflection improves the transparency of fault tolerance mechanisms to the programmer and more generally provides a clearer separation between functional and non-functional programming. The implementations of some classical replication techniques using a reflective approach are presented in detail and illustrated by several examples, which have been prototyped on a network of Unix workstations. Lessons learnt from our experiments are drawn and future work is discussed

    Constructing fail-controlled nodes for distributed systems: a software approach

    Get PDF
    PhD ThesisDesigning and implementing distributed systems which continue to provide specified services in the presence of processing site and communication failures is a difficult task. To facilitate their development, distributed systems have been built assuming that their underlying hardware components are Jail-controlled, i.e. present a well defined failure mode. However, if conventional hardware cannot provide the assumed failure mode, there is a need to build processing sites or nodes, and communication infra-structure that present the fail-controlled behaviour assumed. Coupling a number of redundant processors within a replicated node is a well known way of constructing fail-controlled nodes. Computation is replicated and executed simultaneously at each processor, and by employing suitable validation techniques to the outputs generated by processors (e.g. majority voting, comparison), outputs from faulty processors can be prevented from appearing at the application level. One way of constructing replicated nodes is by introducing hardwired mechanisms to couple replicated processors with specialised validation hardware circuits. Processors are tightly synchronised at the clock cycle level, and have their outputs validated by a reliable validation hardware. Another approach is to use software mechanisms to perform synchronisation of processors and validation of the outputs. The main advantage of hardware based nodes is the minimum performance overhead incurred. However, the introduction of special circuits may increase the complexity of the design tremendously. Further, every new microprocessor architecture requires considerable redesign overhead. Software based nodes do not present these problems, on the other hand, they introduce much bigger performance overheads to the system. In this thesis we investigate alternative ways of constructing efficient fail-controlled, software based replicated nodes. In particular, we present much more efficient order protocols, which are necessary for the implementation of these nodes. Our protocols, unlike others published to date, do not require processors' physical clocks to be explicitly synchronised. The main contribution of this thesis is the precise definition of the semantics of a software based Jail-silent node, along with its efficient design, implementation and performance evaluation.The Brazilian National Research Council (CNPq/Brasil)

    Asynchronous Teams and Tasks in a Message Passing Environment

    Get PDF
    As the discipline of scientific computing grows, so too does the "skills gap" between the increasingly complex scientific applications and the efficient algorithms required. Increasing demand for computational power on the march towards exascale requires innovative approaches. Closing the skills gap avoids the many pitfalls that lead to poor utilisation of resources and wasted investment. This thesis tackles two challenges: asynchronous algorithms for parallel computing and fault tolerance. First I present a novel asynchronous task invocation methodology for Discontinuous Galerkin codes called enclave tasking. The approach modifies the parallel ordering of tasks that allows for efficient scaling on dynamic meshes up to 756 cores. It ensures high levels of concurrency and intermixes tasks of different computational properties. Critical tasks along domain boundaries are prioritised for an overlap of computation and communication. The second contribution is the teaMPI library, forming teams of MPI processes exchanging consistency data through an asynchronous "heartbeat". In contrast to previous approaches, teaMPI operates fully asynchronously with reduced overhead. It is also capable of detecting individually slow or failing ranks and inconsistent data among replicas. Finally I provide an outlook into how asynchronous teams using enclave tasking can be combined into an advanced team-based diffusive load balancing scheme. Both concepts are integrated into and contribute towards the ExaHyPE project, a next generation code that solves hyperbolic equation systems on dynamically adaptive cartesian grids

    Replication and fault-tolerance in real-time systems

    Get PDF
    PhD ThesisThe increased availability of sophisticated computer hardware and the corresponding decrease in its cost has led to a widespread growth in the use of computer systems for realtime plant and process control applications. Such applications typically place very high demands upon computer control systems and the development of appropriate control software for these application areas can present a number of problems not normally encountered in other applications. First of all, real-time applications must be correct in the time domain as well as the value domain: returning results which are not only correct but also delivered on time. Further, since the potential for catastrophic failures can be high in a process or plant control environment, many real-time applications also have to meet high reliability requirements. These requirements will typically be met by means of a combination of fault avoidance and fault tolerance techniques. This thesis is intended to address some of the problems encountered in the provision of fault tolerance in real-time applications programs. Specifically,it considers the use of replication to ensure the availability of services in real-time systems. In a real-time environment, providing support for replicated services can introduce a number of problems. In particular, the scope for non-deterministic behaviour in real-time applications can be quite large and this can lead to difficultiesin maintainingconsistent internal states across the members of a replica group. To tackle this problem, a model is proposed for fault tolerant real-time objects which not only allows such objects to perform application specific recovery operations and real-time processing activities such as event handling, but which also allows objects to be replicated. The architectural support required for such replicated objects is also discussed and, to conclude, the run-time overheads associated with the use of such replicated services are considered.The Science and Engineering Research Council

    Conception et implémentation de systÚmes résilients par une approche à composants

    Get PDF
    L'Ă©volution des systĂšmes pendant leur vie opĂ©rationnelle est incontournable. Les systĂšmes sĂ»rs de fonctionnement doivent Ă©voluer pour s'adapter Ă  des changements comme la confrontation Ă  de nouveaux types de fautes ou la perte de ressources. L'ajout de cette dimension Ă©volutive Ă  la fiabilitĂ© conduit Ă  la notion de rĂ©silience informatique. Parmi les diffĂ©rents aspects de la rĂ©silience, nous nous concentrons sur l'adaptativitĂ©. La sĂ»retĂ© de fonctionnement informatique est basĂ©e sur plusieurs moyens, dont la tolĂ©rance aux fautes Ă  l'exĂ©cution, oĂč l'on attache des mĂ©canismes spĂ©cifiques (Fault Tolerance Mechanisms, FTMs) Ă  l'application. A ce titre, l'adaptation des FTMs Ă  l'exĂ©cution s'avĂšre un dĂ©fi pour dĂ©velopper des systĂšmes rĂ©silients. Dans la plupart des travaux de recherche existants, l'adaptation des FTMs Ă  l'exĂ©cution est rĂ©alisĂ©e de maniĂšre prĂ©programmĂ©e ou se limite Ă  faire varier quelques paramĂštres. Tous les FTMs envisageables doivent ĂȘtre connus dĂšs le design du systĂšme et dĂ©ployĂ©s et attachĂ©s Ă  l'application dĂšs le dĂ©but. Pourtant, les changements ont des origines variĂ©es et, donc, vouloir Ă©quiper un systĂšme pour le pire scĂ©nario est impossible. Selon les observations pendant la vie opĂ©rationnelle, de nouveaux FTMs peuvent ĂȘtre dĂ©veloppĂ©s hors-ligne, mais intĂ©grĂ©s pendant l'exĂ©cution. On dĂ©note cette capacitĂ© comme adaptation agile, par opposition Ă  l'adaptation prĂ©programmĂ©e. Dans cette thĂšse, nous prĂ©sentons une approche pour dĂ©velopper des systĂšmes sĂ»rs de fonctionnement flexibles dont les FTMs peuvent s'adapter Ă  l'exĂ©cution de maniĂšre agile par des modifications Ă  grain fin pour minimiser l'impact sur l'architecture initiale. D'abord, nous proposons une classification d'un ensemble de FTMs existants basĂ©e sur des critĂšres comme le modĂšle de faute, les caractĂ©ristiques de l'application et les ressources nĂ©cessaires. Ensuite, nous analysons ces FTMs et extrayons un schĂ©ma d'exĂ©cution gĂ©nĂ©rique identifiant leurs parties communes et leurs points de variabilitĂ©. AprĂšs, nous dĂ©montrons les bĂ©nĂ©fices apportĂ©s par les outils et les concepts issus du domaine du gĂ©nie logiciel, comme les intergiciels rĂ©flexifs Ă  base de composants, pour dĂ©velopper une librairie de FTMs adaptatifs Ă  grain fin. Nous Ă©valuons l'agilitĂ© de l'approche et illustrons son utilitĂ© Ă  travers deux exemples d'intĂ©gration : premiĂšrement, dans un processus de dĂ©veloppement dirigĂ© par le design pour les systĂšmes ubiquitaires et, deuxiĂšmement, dans un environnement pour le dĂ©veloppement d'applications pour des rĂ©seaux de capteurs. ABSTRACT : Evolution during service life is mandatory, particularly for long-lived systems. Dependable systems, which continuously deliver trustworthy services, must evolve to accommodate changes e.g., new fault tolerance requirements or variations in available resources. The addition of this evolutionary dimension to dependability leads to the notion of resilient computing. Among the various aspects of resilience, we focus on adaptivity. Dependability relies on fault tolerant computing at runtime, applications being augmented with fault tolerance mechanisms (FTMs). As such, on-line adaptation of FTMs is a key challenge towards resilience. In related work, on-line adaption of FTMs is most often performed in a preprogrammed manner or consists in tuning some parameters. Besides, FTMs are replaced monolithically. All the envisaged FTMs must be known at design time and deployed from the beginning. However, dynamics occurs along multiple dimensions and developing a system for the worst-case scenario is impossible. According to runtime observations, new FTMs can be developed off-line but integrated on-line. We denote this ability as agile adaption, as opposed to the preprogrammed one. In this thesis, we present an approach for developing flexible fault-tolerant systems in which FTMs can be adapted at runtime in an agile manner through fine-grained modifications for minimizing impact on the initial architecture. We first propose a classification of a set of existing FTMs based on criteria such as fault model, application characteristics and necessary resources. Next, we analyze these FTMs and extract a generic execution scheme which pinpoints the common parts and the variable features between them. Then, we demonstrate the use of state-of-the-art tools and concepts from the field of software engineering, such as component-based software engineering and reflective component-based middleware, for developing a library of fine-grained adaptive FTMs. We evaluate the agility of the approach and illustrate its usability throughout two examples of integration of the library: first, in a design-driven development process for applications in pervasive computing and, second, in a toolkit for developing applications for WSNs

    Reusable and Extensible Fault Tolerance for RESTful Applications

    Full text link
    Abstract—Despite the simplicity and scalability benefits of REST, rendering RESTful web applications fault-tolerant requires that the programmer write vast amounts of non-trivial, ad-hoc code. Network volatility, HTTP server errors, service outages—all require custom fault handling code, whose effective implementation requires considerable programming expertise and effort. To provide a systematic and principled ap-proach to handling faults in RESTful applications, we present FT-REST—an architectural framework for specifying fault tolerance functionality declaratively and then translating these specifications into platform-specific code. FT-REST encapsu-lates fault tolerance strategies in XML-based specifications and compiles them to modules that reify the requisite fault tolerance. To validate our approach, we have applied FT-REST to enhance several realistic RESTful applications to withstand the faults described in their FT-REST specifications. As REST is said to apply verbs (HTTP commands) to nouns (URIs), FT-REST enhances this conceptual model with adverbs that render REST reliable via reusable and extensible fault tolerance. Keywords-fault tolerance, web services, REST, software reusability, software extensibilit

    MAFTIA Conceptual Model and Architecture

    Get PDF
    This document builds on the work reported in MAFTIA deliverable D1. It contains a refinement of the MAFTIA conceptual model and a discussion of the MAFTIA architecture. It also introduces the work done in WP6 on verification and assessment of security properties, which is reported on in more detail in MAFTIA deliverable D

    Data Storage and Dissemination in Pervasive Edge Computing Environments

    Get PDF
    Nowadays, smart mobile devices generate huge amounts of data in all sorts of gatherings. Much of that data has localized and ephemeral interest, but can be of great use if shared among co-located devices. However, mobile devices often experience poor connectivity, leading to availability issues if application storage and logic are fully delegated to a remote cloud infrastructure. In turn, the edge computing paradigm pushes computations and storage beyond the data center, closer to end-user devices where data is generated and consumed. Hence, enabling the execution of certain components of edge-enabled systems directly and cooperatively on edge devices. This thesis focuses on the design and evaluation of resilient and efficient data storage and dissemination solutions for pervasive edge computing environments, operating with or without access to the network infrastructure. In line with this dichotomy, our goal can be divided into two specific scenarios. The first one is related to the absence of network infrastructure and the provision of a transient data storage and dissemination system for networks of co-located mobile devices. The second one relates with the existence of network infrastructure access and the corresponding edge computing capabilities. First, the thesis presents time-aware reactive storage (TARS), a reactive data storage and dissemination model with intrinsic time-awareness, that exploits synergies between the storage substrate and the publish/subscribe paradigm, and allows queries within a specific time scope. Next, it describes in more detail: i) Thyme, a data storage and dis- semination system for wireless edge environments, implementing TARS; ii) Parsley, a flexible and resilient group-based distributed hash table with preemptive peer relocation and a dynamic data sharding mechanism; and iii) Thyme GardenBed, a framework for data storage and dissemination across multi-region edge networks, that makes use of both device-to-device and edge interactions. The developed solutions present low overheads, while providing adequate response times for interactive usage and low energy consumption, proving to be practical in a variety of situations. They also display good load balancing and fault tolerance properties.Resumo Hoje em dia, os dispositivos mĂłveis inteligentes geram grandes quantidades de dados em todos os tipos de aglomeraçÔes de pessoas. Muitos desses dados tĂȘm interesse loca- lizado e efĂȘmero, mas podem ser de grande utilidade se partilhados entre dispositivos co-localizados. No entanto, os dispositivos mĂłveis muitas vezes experienciam fraca co- nectividade, levando a problemas de disponibilidade se o armazenamento e a lĂłgica das aplicaçÔes forem totalmente delegados numa infraestrutura remota na nuvem. Por sua vez, o paradigma de computação na periferia da rede leva as computaçÔes e o armazena- mento para alĂ©m dos centros de dados, para mais perto dos dispositivos dos utilizadores finais onde os dados sĂŁo gerados e consumidos. Assim, permitindo a execução de certos componentes de sistemas direta e cooperativamente em dispositivos na periferia da rede. Esta tese foca-se no desenho e avaliação de soluçÔes resilientes e eficientes para arma- zenamento e disseminação de dados em ambientes pervasivos de computação na periferia da rede, operando com ou sem acesso Ă  infraestrutura de rede. Em linha com esta dico- tomia, o nosso objetivo pode ser dividido em dois cenĂĄrios especĂ­ficos. O primeiro estĂĄ relacionado com a ausĂȘncia de infraestrutura de rede e o fornecimento de um sistema efĂȘmero de armazenamento e disseminação de dados para redes de dispositivos mĂłveis co-localizados. O segundo diz respeito Ă  existĂȘncia de acesso Ă  infraestrutura de rede e aos recursos de computação na periferia da rede correspondentes. Primeiramente, a tese apresenta armazenamento reativo ciente do tempo (ARCT), um modelo reativo de armazenamento e disseminação de dados com percepção intrĂ­nseca do tempo, que explora sinergias entre o substrato de armazenamento e o paradigma pu- blicação/subscrição, e permite consultas num escopo de tempo especĂ­fico. De seguida, descreve em mais detalhe: i) Thyme, um sistema de armazenamento e disseminação de dados para ambientes sem fios na periferia da rede, que implementa ARCT; ii) Pars- ley, uma tabela de dispersĂŁo distribuĂ­da flexĂ­vel e resiliente baseada em grupos, com realocação preventiva de nĂłs e um mecanismo de particionamento dinĂąmico de dados; e iii) Thyme GardenBed, um sistema para armazenamento e disseminação de dados em redes multi-regionais na periferia da rede, que faz uso de interaçÔes entre dispositivos e com a periferia da rede. As soluçÔes desenvolvidas apresentam baixos custos, proporcionando tempos de res- posta adequados para uso interativo e baixo consumo de energia, demonstrando serem prĂĄticas nas mais diversas situaçÔes. Estas soluçÔes tambĂ©m exibem boas propriedades de balanceamento de carga e tolerĂąncia a faltas

    Flexible and transparent fault tolerance for distributed object-oriented applications

    Get PDF
    This report describes an approach enabling automatic structural reconfigurations of distributed applications based on configuration management in order to compensate for node and network failures. The major goal of the approach is to maintain the relevant application functionality after failures automatically.This goalis achieved by a dedicated system model and by a decentralized reconfiguration algorithm based on it. The system model provides support for redundant application object storage and for application-level consistency based on distributed checkpoints. The reconfiguration algorithm detects failures, computes a compensating configuration, and realizes this new configuration. The report emphasizes flexibility in the sense ofadaptable levels of fault tolerance, as well as transparency in the sense of fully-automatic reaction to failures
    • 

    corecore