4,016 research outputs found

    Fault-Tolerant Adaptive Parallel and Distributed Simulation

    Full text link
    Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

    Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

    Full text link
    This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731

    Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

    Get PDF
    Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance. Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.Comment: Computers in Biology and Medicin

    Exploiting the Use of Cooperation in Self-Organizing Reliable Multiagent Systems

    Get PDF
    In this paper, a novel and cooperative approach is exploited introducing a self-organizing engine to achieve high reliability and availability in multiagent systems. The Adaptive Multiagent Systems theory is applied to design adaptive groups of agents in order to build reliable multiagent systems. According to this theory, adaptiveness is achieved via the cooperative behaviors of agents and their ability to change the communication links autonomously. In this approach, there is not a centralized control mechanism in the multiagent system and there is no need of global knowledge of the system to achieve reliability. This approach was implemented to demonstrate its performance gain in a set of experiments performed under different operating conditions. The experimental results illustrate the effectiveness of this approach

    FATMAS: a methodology to design fault-tolerant multi-agent systems

    Get PDF
    Un système multi-agent (SMA) est un système dans lequel plusieurs agents opèrent et interagissent. Chaque agent a la responsabilité d’exécuter des tâches. Cependant, chaque agent, pour diverses raisons, peut rencontrer des problèmes pendant l’exécution de ses tâches ; ce qui peut induire un disfonctionnement du SMA. Cependant, le SMA doit être en mesure de détecter les sources de problèms (d’erreurs) afin de les contrôler et ainsi continuer son exécution correctement. Un tel SMA est appelé un SMA tolérant aux fautes. Il existe deux types de sources d’erreurs pour un agent : les erreurs causées par son environnment et les erreurs dûes à sa programmation. Dans la littérature, il existe plusieurs techniques qui traitent des erreurs de programmation au niveau des agents. Cependant, ces techniques ne traitent pas des erreurs causées par l’environnement de l’agent. Tout d’abord, nous distinguons entre l’environnment d’un agent et l’environnement du SMA. L’environnement d’un agent représente toutes les composantes matérielles ou logicielles que l’agent ne peut contrôler mais avec lesquelles il interagit. Cependant, l’environnment du SMA représente toutes les composantes que le système ne contrôle pas mais avec lesquelles il interagit. Ainsi, le SMA peut contrôler certaines des composantes avec lesquelles un agent interagit. Ainsi, une composante peut appartenir à l’environnement d’un agent et ne pas appartenir à l’environnement du système. Dans ce travail, nous présentons une méthodologie de conception de SMA tolérants aux fautes, nommée FATMAS, qui permet au concepteur du SMA de détecter et de corriger, si possible, les erreurs causées par les environnements des agents. Cette méthodologie permettra ainsi de délimiter la frontière du SMA de son environnement avec lequel il interagit. La frontière du SMA est déterminée par les différentes composantes (matérielles ou logicielles) que le système contrôle. Ainsi, le SMA, à l’intérieur de sa frontière, peut corriger les erreurs provenant de ses composantes. Cependant, le SMA n’a aucun contrôle sur toutes les composantes opérant dans son environnement. La méthodologie, que nous proposons, doit couvrir les trois premières phases d’un développement logiciel qui sont l’analyse, la conception et l’implémentation tout en intégrant, dans son processus de développement, une technique permettant au concepteur du système de délimiter la frontière du SMA et ainsi détecter les sources d’erreurs et les contrôler afin que le système multi-agent soit tolérant aux fautes (SMATF). Cependant, les méthodologies de conception de SMA, référencées dans la littérature, n’intègrent pas une telle technique. FATMAS offre au concepteur du SMATF quatre modèles pour décrire et développer le SMA ainsi qu’une technique de réorganisation du système qui lui permet de détecter et de contrôler ses sources d’erreurs, et ainsi définir la frontière du SMA. Chaque modèle est associé à un micro processus qui guide le concepteur lors du développement du modèle. FATMAS offre aussi un macro-processus, qui définit le cycle de développement de la méthodologie. FATMAS se base sur un développement itératif pour identifier et déterminer les tâches à ajouter au système afin de contrôler des sources d’erreurs. À chaque itération, le concepteur évalue, selon une fonction de coût/bénéfice s’il est opportun d’ajouter de nouvelles tâches de contrôle au système. Le premier modèle est le modèle de tâches-environnement. Il est développé lors de la phase d’analyse. Il identifie les différentes tâches que les agents doivent exécuter, leurs préconditions et leurs ressources. Ce modèle permet d’identifier différentes sources de problèmes qui peuvent causer un disfonctionnement du système. Le deuxième modèle est le modèle d’agents. Il est développé lors de la phase de conception. Il décrit les agents, leurs relations, et spécifie pour chaque agent les ressources auxquelles il a le droit d’accéder. Chaque agent exécutera un ensemble de tâches identifiées dans le modèle de tâches-environnement. Le troisième modèle est le modèle d’interaction d’agents. Il est développé lors de la phase de conception. Il décrit les échanges de messages entre les agents. Le quatrième modèle est le modèle d’implémentation. Il est développé lors de la phase d’implémentation. Il décrit l’infrastructure matérielle sur laquelle le SMA va opérer ainsi que l’environnement de développement du SMA. La méthodologie inclut aussi une technique de réorganisation. Cette technique permet de délimiter la frontière du SMA et contrôler, si possible, ses sources d’erreurs. Cette technique doit intégrer trois techniques nécessaires à la conception d’un système tolérant aux fautes : une technique de prévention d’erreurs, une technique de recouvrement d’erreurs, et une technique de tolérance aux fautes. La technique de prévention d’erreurs permet de délimiter la frontière du SMA. La technique de recouvrement d’erreurs permet de proposer une architecture du SMA pour détecter les erreurs. La technique de tolérance aux fautes permet de définir une procédure de réplication d’agents et de tâches dans le SMA pour que le SMA soit tolérant aux fautes. Cette dernière technique, à l’inverse des techniques de tolérance aux fautes existantes, réplique les tâches et les agents et non seulement les agents. Elle permet ainsi de réduire la complexité du système en diminuant le nombre d’agents à répliquer. Résumé iv De même, un agent peut ne pas être en erreur mais la composante matérielle sur laquelle il est exécuté peut ne plus être fonctionnelle. Ce qui constitue une source d’erreurs pour le SMA. Il faudrait alors que le SMA continue à s’exécuter correctement malgrè le disfonctionnement d’une composante. FATMAS fournit alors un support au concepteur du système pour tenir compte de ce type d’erreurs soit en contrôlant les composantes matérielles, soit en proposant une distribution possible des agents sur les composantes matérielles disponibles pour que le disfonctionnement d’une composante matérielle n’affecte pas le fonctionnement du SMA. FATMAS permet d’identifier des sources d’erreurs lors de la phase de conception du système. Cependant, elle ne traite pas des sources d’erreurs de programmation. Ainsi, la technique de réorganization proposée dans ce travail sera validée par rapport aux sources d’erreurs identifiées lors de la phase de conception et provenant de la frontière du SMA. Nous démontrerons formellement que, si une erreur provient d’une composante que le SMA contrôle, le SMA devrait être opérationnel. Cependant, FATMAS ne certifie pas que le futur système sera toujours opérationnel car elle ne traîte pas des erreurs de programmation ou des erreurs causées par son environnement.A multi-agent system (MAS) consists of several agents interacting together. In a MAS, each agent performs several tasks. However, each agent is prone to individual failures so that it can no longer perform its tasks. This can lead the MAS to a failure. Ideally, the MAS should be able to identify the possible sources of failures and try to overcome them in order to continue operating correctly ; we say that it should be fault-tolerant. There are two kinds of sources of failures to an agent : errors originating from the environment with which the agents interacts, and programming exceptions. There are several works on fault-tolerant systems which deals with programming exceptions. However, these techniques does not allow the MAS to identify errors originating from an agent’s environment. In this thesis, we propose a design methodology, called FATMAS, which allows a MAS designer to identify errors originating from agents’ environments. Doing so, the designer can determine the sources of failures it could be able to control and those it could not. Hence, it can determine the errors it can prevent and those it cannot. Consequently, this allows the designer to determine the system’s boundary from its environment. The system boundary is the area within which the decision-taking process of the MAS has power to make things happen, or prevent them from happening.We distinguish between the system’s environment and an agent’s environment. An agent’s environment is characterized by the components (hardware or software) that the agent does not control. However, the system may control some of the agent’s environment components. Consequently, some of the agent’s environment components may not be a part of the system’s environment. The development of a fault-tolerant MAS (FTMAS) requires the use of a methodology to design FTMAS and of a reorganization technique that will allow the MAS designer to identify and control, if possible, different sources of system failure. However, current MAS design methodologies do not integrate such a technique. FATMAS provides four models used to design and implement the target system and a reorganization technique to assist the designer in identifying and controlling different sources of system’s failures. FATMAS also provides a macro process which covers the entire life cycle of the system development as well as several micro processes that guide the designer when developing each model. The macro-process is based on an iterative approach based on a cost/benefit evaluation to help the designer to decide whether to go from one iteration to another. The methodology has three phases : analysis, design, and implementation. The analysis phase develops the task-environment model. This model identifies the different tasks the agents will perform, their resources, and their preconditions. It identifies several possible sources of system failures. The design phase develops the agent model and the agent interaction model. The agent model describes the agents and their resources. Each agent performs several tasks identified in the task-environment model. The agent interaction model describes the messages exchange between agents. The implementation phase develops the implementation model, and allows an automatic code generation of Java agents. The implementation model describes the infrastructure upon which the MAS will operate and the development environment to be used when developing the MAS. The reorganization technique includes three techniques required to design a fault-tolerant system : a fault-prevention technique, a fault-recovery technique, and a fault-tolerance technique. The fault-prevention technique assists the designer in delimiting the system’s boundary. The fault-recovery technique proposes a MAS architecture allowing it to detect failures. The fault-tolerance technique is based on agent and task redundancy. Contrary to existing fault-tolerance techniques, this technique replicates tasks and agents and not only agents. Thus, it minimizes the system complexity by minimizing the number of agents operating in the system. Furthermore, FATMAS helps the designer to deal with possible physical component failures, on which the MAS will operate. It proposes a way to either control these components or to distribute the agents on these components in such a way that if a component is in failure, then the MAS could continue operating properly. The FATMAS methodology presented in this dissertation assists a designer, in its development process, to build fault-tolerant systems. It has the following main contributions : 1. it allows to identify different sources of system failure ; 2. it proposes to introduce new tasks in a MAS to control the identified sources of failures ; 3. it proposes a mechanism which automatically determines which tasks (agents) should be replicated and in which other agents ; 4. it reduces the system complexity by minimizing the replication of agents ; Abstract vii 5. it proposes a MAS reorganization technique which is embedded within the designed MAS and assists the designer to determine the system’s boundary. It proposes a MAS architecture to detect and recover from failures originating from the system boundary. Moreover, it proposes a way to distribute agents on the physical components so that the MAS could continue operating properly in case of a component failure. This could make the MAS more robust to fault prone environments. FATMAS alows to determine different sources of failures of a MAS. The MAS controls the sources of failures situated in its boundary. It does not control the sources of failures situated in its environments. Consequently, the reorganization technique proposed in this dissertation will be proven valid only in the case where the sources of failures are controlled by the MAS. However, it cannot be proven that the future system is fault-tolerant since faults originating from the environment or from coding are not dealt with

    A Transition from Traditional Checkpointing towards Multi-Agent based Approaches

    Full text link
    • …
    corecore