Search CORE

29,834 research outputs found

Volume 71, Number 21, March 21, 1952

Author: A.S. Tanenbaum
F. Schneider
G.V. Chockler
Publication venue: 'Atelier Fluxus Virus'
Publication date: 21/03/1952
Field of study

14th Turkish Symposium on Artificial Intelligence and Neural Networks -- JUN 16-17, 2005 -- Izmir, TURKEYWOS: 000239585200025Replication of data or processes is an effective way to provide enhanced performance, high availability and fault tolerance in distributed systems. For instance, in systems based on the client-server model, a server may serve many clients and because of heavy loads, the server cannot respond to the requests on time. In such a case, replicating data or servers may improve performance. Moreover, data and processes can be replicated to protect against failures. However, this is a very complex procedure. In this paper, I propose a method, to make systems fault tolerant based on replication, by way of exploiting the use of collaborative agents. This method is also used to improve fault tolerance in multi-agent systems.Izmir Inst Technol, EE & CE Depts, Turkish Sci & Res Council, Izmir Branch Chamber Elect & Elect Engineer

Crossref

Lawrence University

Ege University Institutional Repository

Using mobility and exception handling to achieve mobile agents that survive server crash failures

Author: Pears Simon
Publication venue
Publication date: 01/01/2005
Field of study

Mobile agent technology, when designed and used effectively, can minimize bandwidth consumption and autonomously provide a snapshot of the current context of a distributed system. Protecting mobile agents from server crashes is a challenging issue, since developers normally have no control over remote servers. Server crash failures can leave replicas, instable storage, unavailable for an unknown time period. Furthermore, few systems have considered the need for using a fault tolerant protocol among a group of collaborating mobile agents. This thesis uses exception handling to protect mobile agents from server crash failures. An exception model is proposed for mobile agents and two exception handler designs are investigated. The first exists at the server that created the mobile agent and uses a timeout mechanism. The second, the mobile shadow scheme, migrates with the mobile agent and operates at the previous server visited by the mobile agent. A case study application has been developed to compare the performance of the two exception handler designs. Performance results demonstrate that although the second design is slower it offers the smaller trip time when handling a server crash. Furthermore, no modification of the server environment is necessary. This thesis shows that the mobile shadow exception handling scheme reduces complexity for a group of mobile agents to survive server crashes. The scheme deploys a replica that monitors the server occupied by the master, at each stage of the itinerary. The replica exists at the previous server visited in the itinerary. Consequently, each group member is a single fault tolerant entity with respect to server crash failures. Other schemes introduce greater complexity and performance overheads since, for each stage of the itinerary, a group of replicas is sent to servers that offer an equivalent service. In addition, future research is established for fault tolerance in groups of collaborating mobile agents

Durham e-Theses

OpenGrey Repository

Distributed Adaptive Fault-Tolerant Control of Uncertain Multi-Agent Systems

Author: Cao Yongcan
Khalili Mohsen
Parisini Thomas
Polycarpou Marios M.
Zhang Xiaodong
Publication venue
Publication date: 01/01/2015
Field of study

This paper presents an adaptive fault-tolerant control (FTC) scheme for a class of nonlinear uncertain multi-agent systems. A local FTC scheme is designed for each agent using local measurements and suitable information exchanged between neighboring agents. Each local FTC scheme consists of a fault diagnosis module and a reconfigurable controller module comprised of a baseline controller and two adaptive fault-tolerant controllers activated after fault detection and after fault isolation, respectively. Under certain assumptions, the closed-loop system's stability and leader-follower consensus properties are rigorously established under different modes of the FTC system, including the time-period before possible fault detection, between fault detection and possible isolation, and after fault isolation

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Trieste

An approach to rollback recovery of collaborating mobile agents

Author: Bargiela A
Osman T
Wagealla W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/02/2004
Field of study

Fault-tolerance is one of the main problems that must be resolved to improve the adoption of the agents' computing paradigm. In this paper, we analyse the execution model of agent platforms and the significance of the faults affecting their constituent components on the reliable execution of agent-based applications, in order to develop a pragmatic framework for agent systems fault-tolerance. The developed framework deploys a communication-pairs independent check pointing strategy to offer a low-cost, application-transparent model for reliable agent- based computing that covers all possible faults that might invalidate reliable agent execution, migration and communication and maintains the exactly-one execution property

Crossref

Nottingham Trent Institutional Repository (IRep)

Fault-Tolerant Adaptive Parallel and Distributed Simulation

Author: Armaroli Lorenzo
D'Angelo Gabriele
Ferretti Stefano
Marzolla Moreno
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

Author: Alexandrov Vassil
McKee Gerard
Varghese Blesson
Publication venue: 'Elsevier BV'
Publication date: 03/03/2014
Field of study

Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance. Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.Comment: Computers in Biology and Medicin

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

University of St. Andrews - Pure

St Andrews Research Repository