12,656 research outputs found

    A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

    Full text link
    Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

    State-preserving container orchestration in failover scenarios

    Get PDF
    Containers have been widely adopted for deployment of high availability applications and services. This adoption is in part due to the native support of fault tolerance mechanisms in container orchestration frameworks such as Kubernetes. While Kubernetes provides service replication as a fault tolerance mechanism for stateless applications, service replication does not satisfy requirements for stateful applications. Currently this shortcoming is addressed by data replication in databases. This requires a tight coupling and modification of the stateful application to support high availability. Thus, this thesis proposes a new Checkpoint/Restore (C/R) Kubernetes operator to achieve fault tolerance for stateful applications without any modification of the application. The operator takes a checkpoint in a configurable interval. In case of a fault a new application container is created automatically from the most recent checkpoint. We compare the proposed approach with a more conventional approach in which we pull and restore the application state from the application through an API. We measure the overhead of both methods, the service interruption and the recovery time in case of faults. We find the C/R Operator has similar performance in recovery time as the traditional approach, but does not need any application modification. The results signify C/R as a promising technology for a fault tolerance mechanism for stateful applications

    Robust data storage in a network of computer systems

    Get PDF
    PhD ThesisRobustness of data in this thesis is taken to mean reliable storage of data and also high availability of data .objects in spite of the occurrence of faults. Algorithms and data structures which can be used to provide such robustness in the presence of various disk, processor and communication network failures are described. Reliable storage of data at individual nodes in a network of computer systems is based on the use of a stable storage mechanism combined with strategies which are used to help ensure crash resis- tance of file operations in spite of the use of buffering mechan- isms by operating systems. High availability of data in the net- work is maintained by replicating data on different computers and mutual consistency between replicas is ensured in spite of network partitioning. A stable storage system which provides atomicity for more complex data structures instead of the usual fixed size page has been designed and implemented and its performance evaluated. A crash resistant file system has also been implemented and evaluated. Many of the techniques presented here are used in the design of what we call CRES (Crash-resistant, Replicated and Stable) storage. CRES storage provides fault tolerance facilities for various disk and processor faults. It also provides fault tolerance facilities for network partitioning through the provision of an algorithm for the update and merge of a partitioned data storage system

    Self-management for large-scale distributed systems

    Get PDF
    Autonomic computing aims at making computing systems self-managing by using autonomic managers in order to reduce obstacles caused by management complexity. This thesis presents results of research on self-management for large-scale distributed systems. This research was motivated by the increasing complexity of computing systems and their management. In the first part, we present our platform, called Niche, for programming self-managing component-based distributed applications. In our work on Niche, we have faced and addressed the following four challenges in achieving self-management in a dynamic environment characterized by volatile resources and high churn: resource discovery, robust and efficient sensing and actuation, management bottleneck, and scale. We present results of our research on addressing the above challenges. Niche implements the autonomic computing architecture, proposed by IBM, in a fully decentralized way. Niche supports a network-transparent view of the system architecture simplifying the design of distributed self-management. Niche provides a concise and expressive API for self-management. The implementation of the platform relies on the scalability and robustness of structured overlay networks. We proceed by presenting a methodology for designing the management part of a distributed self-managing application. We define design steps that include partitioning of management functions and orchestration of multiple autonomic managers. In the second part, we discuss robustness of management and data consistency, which are necessary in a distributed system. Dealing with the effect of churn on management increases the complexity of the management logic and thus makes its development time consuming and error prone. We propose the abstraction of Robust Management Elements, which are able to heal themselves under continuous churn. Our approach is based on replicating a management element using finite state machine replication with a reconfigurable replica set. Our algorithm automates the reconfiguration (migration) of the replica set in order to tolerate continuous churn. For data consistency, we propose a majority-based distributed key-value store supporting multiple consistency levels that is based on a peer-to-peer network. The store enables the tradeoff between high availability and data consistency. Using majority allows avoiding potential drawbacks of a master-based consistency control, namely, a single-point of failure and a potential performance bottleneck. In the third part, we investigate self-management for Cloud-based storage systems with the focus on elasticity control using elements of control theory and machine learning. We have conducted research on a number of different designs of an elasticity controller, including a State-Space feedback controller and a controller that combines feedback and feedforward control. We describe our experience in designing an elasticity controller for a Cloud-based key-value store using state-space model that enables to trade-off performance for cost. We describe the steps in designing an elasticity controller. We continue by presenting the design and evaluation of ElastMan, an elasticity controller for Cloud-based elastic key-value stores that combines feedforward and feedback control

    A multiscale analysis of gene flow for the New England cottontail, an imperiled habitat specialist in a fragmented landscape

    Get PDF
    Landscape features of anthropogenic or natural origin can influence organisms\u27 dispersal patterns and the connectivity of populations. Understanding these relationships is of broad interest in ecology and evolutionary biology and provides key insights for habitat conservation planning at the landscape scale. This knowledge is germane to restoration efforts for the New England cottontail (Sylvilagus transitionalis), an early successional habitat specialist of conservation concern. We evaluated local population structure and measures of genetic diversity of a geographically isolated population of cottontails in the northeastern United States. We also conducted a multiscale landscape genetic analysis, in which we assessed genetic discontinuities relative to the landscape and developed several resistance models to test hypotheses about landscape features that promote or inhibit cottontail dispersal within and across the local populations. Bayesian clustering identified four genetically distinct populations, with very little migration among them, and additional substructure within one of those populations. These populations had private alleles, low genetic diversity, critically low effective population sizes (3.2-36.7), and evidence of recent genetic bottlenecks. Major highways and a river were found to limit cottontail dispersal and to separate populations. The habitat along roadsides, railroad beds, and utility corridors, on the other hand, was found to facilitate cottontail movement among patches. The relative importance of dispersal barriers and facilitators on gene flow varied among populations in relation to landscape composition, demonstrating the complexity and context dependency of factors influencing gene flow and highlighting the importance of replication and scale in landscape genetic studies. Our findings provide information for the design of restoration landscapes for the New England cottontail and also highlight the dual influence of roads, as both barriers and facilitators of dispersal for an early successional habitat specialist in a fragmented landscape

    Security Framework for Decentralized Shared Calendars

    Get PDF
    International audienceWe propose a security framework for Decentralized Shared Calendar. The proposed security framework provides confidentiality to replicated shared calendar events and secures the commu- nication between users. It is designed in such a way that DeSCal preserves all of its characteristic features like fault-tolerance, crash recovery, availability and dynamic access control. It has been implemented on iPhone OS.Nous proposons un protocole de sécurité pour des agendas partagés dont la gestion de données est complètement décentralisée. Dans ce protocole, nous assurons à la fois (i) la confidentialité du contenu répliqué et (ii) la sécurité de communication entre les utilisateurs. Comme nous utilisons une réplication complête de données, notre protocole préserve toutes les caractéristiques d'une telle réplication, à savoir : la tolérance aux pannes et la reprise après panne. Pour valider notre solution, nous avons implémenté un prototype sur des mobiles tournant sous le système iPhone OS

    Trust in management: The effect of managerial trustworthy behavior and reciprocity

    Get PDF
    In this paper we study the antecedents of subordinates’ trust in their leaders (STL). In particular, we focus on the effects of managerial trustworthy behavior (MTB) and subordinates’ perceptions of leaders’ trust in them (LTS). We develop a scale of managerial trustworthy behavior following the typology proposed by Whitener, Brodt, Korsgaard and Werner (1998) that includes: behavioral consistency, behavioral integrity, sharing and delegation of control, communication, and demonstration of concern. A sample of 109 Spanish middle managers provided data for our study. The results of the hierarchical regression analysis show that both MTB and LTS have a significant relationship with STL. Further, we study the effect of reciprocity in the trusting relationship. We find that there are significant differences between subordinates’ trust in management and their perceptions about superiors’ trust in them.trust; leadership; reciprocity; social exchange;

    Fault tolerant software technology for distributed computing system

    Get PDF
    Issued as Monthly reports [nos. 1-23], Interim technical report, Technical guide books [nos. 1-2], and Final report, Project no. G-36-64
    • …
    corecore