630 research outputs found

    Space Reclamation for Uncoordinated Checkpointing in Message-Passing Systems

    Get PDF
    Checkpointing and rollback recovery are techniques that can provide efficient recovery from transient process failures. In a message-passing system, the rollback of a message sender may cause the rollback of the corresponding receiver, and the system needs to roll back to a consistent set of checkpoints called recovery line. If the processes are allowed to take uncoordinated checkpoints, the above rollback propagation may result in the domino effect which prevents recovery line progression. Traditionally, only obsolete checkpoints before the global recovery line can be discarded, and the necessary and sufficient condition for identifying all garbage checkpoints has remained an open problem. A necessary and sufficient condition for achieving optimal garbage collection is derived and it is proved that the number of useful checkpoints is bounded by N(N+1)/2, where N is the number of processes. The approach is based on the maximum-sized antichain model of consistent global checkpoints and the technique of recovery line transformation and decomposition. It is also shown that, for systems requiring message logging to record in-transit messages, the same approach can be used to achieve optimal message log reclamation. As a final topic, a unifying framework is described by considering checkpoint coordination and exploiting piecewise determinism as mechanisms for bounding rollback propagation, and the applicability of the optimal garbage collection algorithm to domino-free recovery protocols is demonstrated

    RADIC II : a fault tolerant architecture with flexible dynamic redundancy

    Get PDF
    The demand for computational power has been leading the improvement of the High Performance Computing (HPC) area, generally represented by the use of distributed systems like clusters of computers running parallel applications. In this area, fault tolerance plays an important role in order to provide high availability isolating the application from the faults effects. Performance and availability form an undissociable binomial for some kind of applications. Therefore, the fault tolerant solutions must take into consideration these two constraints when it has been designed. In this dissertation, we present a few side-effects that some fault tolerant solutions may presents when recovering a failed process. These effects may causes degradation of the system, affecting mainly the overall performance and availability. We introduce RADIC-II, a fault tolerant architecture for message passing based on RADIC (Redundant Array of Distributed Independent Fault Tolerance Controllers) architecture. RADIC-II keeps as maximum as possible the RADIC features of transparency, decentralization, flexibility and scalability, incorporating a flexible dynamic redundancy feature, allowing to mitigate or to avoid some recovery side-effects.La demanda de computadores más veloces ha provocado el incremento del área de computación de altas prestaciones, generalmente representado por el uso de sistemas distribuidos como los clusters de computadores ejecutando aplicaciones paralelas. En esta área, la tolerancia a fallos juega un papel muy importante a la hora de proveer alta disponibilidad, aislando los efectos causados por los fallos. Prestaciones y disponibilidad componen un binomio indisociable para algunos tipos de aplicaciones. Por eso, las soluciones de tolerancia a fallos deben tener en consideración estas dos restricciones desde el momento de su diseño. En esta disertación, presentamos algunos efectos colaterales que se puede presentar en ciertas soluciones tolerantes a fallos cuando recuperan un proceso fallado. Estos efectos pueden causar una degradación del sistema, afectando las prestaciones y disponibilidad finales. Presentamos RADIC-II, una arquitectura tolerante a fallos para paso de mensajes basada en la arquitectura RADIC (Redundant Array of Distributed Independent Fault Tolerance Controllers). RADIC-II mantiene al máximo posible las características de transparencia, descentralización, flexibilidad y escalabilidad existentes en RADIC, e incorpora una flexible funcionalidad de redundancia dinámica, que permite mitigar o evitar algunos efectos colaterales en la recuperación

    Using Permuted States of Validated Simulation to Analyze Conflict Rates in Optimistic Replication

    Get PDF
    Optimistic replication provides high data availability in the presence of network outages. Although widely deployed, this relaxed consistency model introduces concurrent updates, whose behavior is poorly understood due to the vast state space. This paper introduces the notion of permuted states to eliminate system states that are redundant and unreachable, which can constitute the majority of states (4069 out of 4096 for four replicas). With the aid of permuted states, we are for the first time able to construct analytical models beyond the two-replica case. By examining the analysis for 2 to 4 replicas, we can demystify the process of forming identical conflicts—the most common conflict type at high replication factors. Additionally, we have automated and optimized the generation of permuted states, which allows us to explore higher replication factors (up to 10 replicas) using hybrid techniques. It also allows us to validate our results with existing simulations based on actual replication mechanisms, which previously were analytically validated with only one pair of replicas. Finally, we have discovered that update locality and bimodal access patterns are the primary factors contributing to the formation of identical conflicts

    Contention management for distributed data replication

    Get PDF
    PhD ThesisOptimistic replication schemes provide distributed applications with access to shared data at lower latencies and greater availability. This is achieved by allowing clients to replicate shared data and execute actions locally. A consequence of this scheme raises issues regarding shared data consistency. Sometimes an action executed by a client may result in shared data that may conflict and, as a consequence, may conflict with subsequent actions that are caused by the conflicting action. This requires a client to rollback to the action that caused the conflicting data, and to execute some exception handling. This can be achieved by relying on the application layer to either ignore or handle shared data inconsistencies when they are discovered during the reconciliation phase of an optimistic protocol. Inconsistency of shared data has an impact on the causality relationship across client actions. In protocol design, it is desirable to preserve the property of causality between different actions occurring across a distributed application. Without application level knowledge, we assume an action causes all the subsequent actions at the same client. With application knowledge, we can significantly ease the protocol burden of provisioning causal ordering, as we can identify which actions do not cause other actions (even if they precede them). This, in turn, makes possible the client’s ability to rollback to past actions and to change them, without having to alter subsequent actions. Unfortunately, increased instances of application level causal relations between actions lead to a significant overhead in protocol. Therefore, minimizing the rollback associated with conflicting actions, while preserving causality, is seen as desirable for lower exception handling in the application layer. In this thesis, we present a framework that utilizes causality to create a scheduler that can inform a contention management scheme to reduce the rollback associated with the conflicting access of shared data. Our framework uses a backoff contention management scheme to provide causality preserving for those optimistic replication systems with high causality requirements, without the need for application layer knowledge. We present experiments which demonstrate that our framework reduces clients’ rollback and, more importantly, that the overall throughput of the system is improved when the contention management is used with applications that require causality to be preserved across all actions

    Lazy Checkpoint Coordination for Bounding Rollback Propagation

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNational Aeronautics and Space Administration / NASA NAG 1-613Department of the Navy managed by the Office of the Chief of Naval Research / N00014-91-J-128
    • …
    corecore