4,977 research outputs found

    Fault-Tolerant Adaptive Parallel and Distributed Simulation

    Full text link
    Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

    Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

    Full text link
    This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731

    Event-B Patterns for Specifying Fault-Tolerance in Multi-Agent Interaction

    No full text
    Interaction in a multi-agent system is susceptible to failure. A rigorous development of a multi-agent system must include the treatment of fault-tolerance of agent interactions for the agents to be able to continue to function independently. Patterns can be used to capture fault-tolerance techniques. A set of modelling patterns is presented that specify fault-tolerance in Event-B specifications of multi-agent interactions. The purpose of these patterns is to capture common modelling structures for distributed agent interaction in a form that is re-usable on other related developments. The patterns have been applied to a case study of the contract net interaction protocol

    HT-Paxos: High Throughput State-Machine Replication Protocol for Large Clustered Data Centers

    Get PDF
    Paxos is a prominent theory of state machine replication. Recent data intensive Systems those implement state machine replication generally require high throughput. Earlier versions of Paxos as few of them are classical Paxos, fast Paxos and generalized Paxos have a major focus on fault tolerance and latency but lacking in terms of throughput and scalability. A major reason for this is the heavyweight leader. Through offloading the leader, we can further increase throughput of the system. Ring Paxos, Multi Ring Paxos and S-Paxos are few prominent attempts in this direction for clustered data centers. In this paper, we are proposing HT-Paxos, a variant of Paxos that one is the best suitable for any large clustered data center. HT-Paxos further offloads the leader very significantly and hence increases the throughput and scalability of the system. While at the same time, among high throughput state-machine replication protocols, HT-Paxos provides reasonably low latency and response time

    A Demand Based Load Balanced Service Replication Model

    Get PDF
    Cloud computing allows service users and providers to access the applications, logical resources and files on any computer with ease. A cloud service has three distinct characteristics that differentiate it from traditional hosting. It is sold on demand, typically by the minute or the hour; it is elastic. It is a way to increase capacity or add capabilities on the fly without investing in new infrastructure, training new personnel, or licensing new software. It not only promises reliable services delivered through next-generation data centers that are built on compute and storage virtualization technologies but also addresses the key issues such as scalability, reliability, fault tolerance and file load balancing. The one way to achieve this is through service replication across different machines coupled with load balancing. Though replication potentially improves fault tolerance, it leads to the problem of ensuring consistency of replicas when certain service is updated or modified. However, fewer replicas also decrease concurrency and the level of service availability. A balanced synchronization between replication mechanism and consistency not only ensures highly reliable and fault tolerant system but also improves system performance significantly. This paper presents a load balancing based service replication model that creates a replica on other servers on the basis of number of service accesses. The simulation results indicate that the proposed model reduces the number of messages exchanged for service replication by 25-55% thus improving the overall system performance significantly. Also in case of CPU load based file replication, it is observed that file access time reduces by 5.56%-7.65%

    Exploiting the Use of Cooperation in Self-Organizing Reliable Multiagent Systems

    Get PDF
    In this paper, a novel and cooperative approach is exploited introducing a self-organizing engine to achieve high reliability and availability in multiagent systems. The Adaptive Multiagent Systems theory is applied to design adaptive groups of agents in order to build reliable multiagent systems. According to this theory, adaptiveness is achieved via the cooperative behaviors of agents and their ability to change the communication links autonomously. In this approach, there is not a centralized control mechanism in the multiagent system and there is no need of global knowledge of the system to achieve reliability. This approach was implemented to demonstrate its performance gain in a set of experiments performed under different operating conditions. The experimental results illustrate the effectiveness of this approach

    The Impact of RDMA on Agreement

    Full text link
    Remote Direct Memory Access (RDMA) is becoming widely available in data centers. This technology allows a process to directly read and write the memory of a remote host, with a mechanism to control access permissions. In this paper, we study the fundamental power of these capabilities. We consider the well-known problem of achieving consensus despite failures, and find that RDMA can improve the inherent trade-off in distributed computing between failure resilience and performance. Specifically, we show that RDMA allows algorithms that simultaneously achieve high resilience and high performance, while traditional algorithms had to choose one or another. With Byzantine failures, we give an algorithm that only requires n2fP+1n \geq 2f_P + 1 processes (where fPf_P is the maximum number of faulty processes) and decides in two (network) delays in common executions. With crash failures, we give an algorithm that only requires nfP+1n \geq f_P + 1 processes and also decides in two delays. Both algorithms tolerate a minority of memory failures inherent to RDMA, and they provide safety in asynchronous systems and liveness with standard additional assumptions.Comment: Full version of PODC'19 paper, strengthened broadcast algorith
    corecore