Search CORE

15,750 research outputs found

Fault-Tolerant Adaptive Parallel and Distributed Simulation

Author: Armaroli Lorenzo
D'Angelo Gabriele
Ferretti Stefano
Marzolla Moreno
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

Author: D'Angelo Gabriele
Ferretti Stefano
Marzolla Moreno
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Urbino

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Software-based fault-tolerant routing algorithm in multidimensional networks

Author: Alzeidi N.
Fathy M.
Khonsari A.
Ould-Khaoua M.
Rezazad M.
Safaei F.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Massively parallel computing systems are being built with hundreds or thousands of components such as nodes, links, memories, and connectors. The failure of a component in such systems will not only reduce the computational power but also alter the network's topology. The software-based fault-tolerant routing algorithm is a popular routing to achieve fault-tolerance capability in networks. This algorithm is initially proposed only for two dimensional networks (Suh et al., 2000). Since, higher dimensional networks have been widely employed in many contemporary massively parallel systems; this paper proposes an approach to extend this routing scheme to these indispensable higher dimensional networks. Deadlock and livelock freedom and the performance of presented algorithm, have been investigated for networks with different dimensionality and various fault regions. Furthermore, performance results have been presented through simulation experiments

Crossref

Enlighten

Design and simulation of advanced fault tolerant flight control schemes

Author: Gururajan Srikanth
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2006
Field of study

This research effort describes the design and simulation of a distributed Neural Network (NN) based fault tolerant flight control scheme and the interface of the scheme within a simulation/visualization environment. The goal of the fault tolerant flight control scheme is to recover an aircraft from failures to its sensors or actuators. A commercially available simulation package, Aviator Visual Design Simulator (AVDS), was used for the purpose of simulation and visualization of the aircraft dynamics and the performance of the control schemes.;For the purpose of the sensor failure detection, identification and accommodation (SFDIA) task, it is assumed that the pitch, roll and yaw rate gyros onboard are without physical redundancy. The task is accomplished through the use of a Main Neural Network (MNN) and a set of three De-Centralized Neural Networks (DNNs), providing analytical redundancy for the pitch, roll and yaw gyros. The purpose of the MNN is to detect a sensor failure while the purpose of the DNNs is to identify the failed sensor and then to provide failure accommodation. The actuator failure detection, identification and accommodation (AFDIA) scheme also features the MNN, for detection of actuator failures, along with three Neural Network Controllers (NNCs) for providing the compensating control surface deflections to neutralize the failure induced pitching, rolling and yawing moments. All NNs continue to train on-line, in addition to an offline trained baseline network structure, using the Extended Back-Propagation Algorithm (EBPA), with the flight data provided by the AVDS simulation package.;The above mentioned adaptive flight control schemes have been traditionally implemented sequentially on a single computer. This research addresses the implementation of these fault tolerant flight control schemes on parallel and distributed computer architectures, using Berkeley Software Distribution (BSD) sockets and Message Passing Interface (MPI) for inter-process communication

The Research Repository @ WVU (West Virginia University)

Performance modeling of fault-tolerant circuit-switched communication networks

Author: Alzeidi N.
Fathy M.
Khonsari A.
Ould-Khaoua M.
Safaei F.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Circuit switching (CS) has been suggested as an efficient switching method for supporting simultaneous communications (such as data, voice, and images) across parallel systems due to its ability to preserve both communication performance and fault-tolerant demands in such systems. In this paper we present an efficient scheme to capture the mean message latency in 2D torus with CS in the presence of faulty components. We have also conducted extensive simulation experiments, the results of which are used to validate the analytical mode

Crossref

Enlighten

rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Parallel Independent Tasks

Author: Cavelan Aurelien
Ciorba Florina M.
Mohammed Ali
Publication venue
Publication date: 01/01/2019
Field of study

Scientific applications often contain large and computationally intensive parallel loops. Dynamic loop self scheduling (DLS) is used to achieve a balanced load execution of such applications on high performance computing (HPC) systems. Large HPC systems are vulnerable to processors or node failures and perturbations in the availability of resources. Most self-scheduling approaches do not consider fault-tolerant scheduling or depend on failure or perturbation detection and react by rescheduling failed tasks. In this work, a robust dynamic load balancing (rDLB) approach is proposed for the robust self scheduling of independent tasks. The proposed approach is proactive and does not depend on failure or perturbation detection. The theoretical analysis of the proposed approach shows that it is linearly scalable and its cost decrease quadratically by increasing the system size. rDLB is integrated into an MPI DLS library to evaluate its performance experimentally with two computationally intensive scientific applications. Results show that rDLB enables the tolerance of up to (P minus one) processor failures, where P is the number of processors executing an application. In the presence of perturbations, rDLB boosted the robustness of DLS techniques up to 30 times and decreased application execution time up to 7 times compared to their counterparts without rDLB

arXiv.org e-Print Archive

edoc

A Reliable and Cost-Efficient Auto-Scaling System for Web Applications Using Heterogeneous Spot Instances

Author: Buyya Rajkumar
Calheiros Rodrigo N.
Qu Chenhao
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Cloud providers sell their idle capacity on markets through an auction-like mechanism to increase their return on investment. The instances sold in this way are called spot instances. In spite that spot instances are usually 90% cheaper than on-demand instances, they can be terminated by provider when their bidding prices are lower than market prices. Thus, they are largely used to provision fault-tolerant applications only. In this paper, we explore how to utilize spot instances to provision web applications, which are usually considered availability-critical. The idea is to take advantage of differences in price among various types of spot instances to reach both high availability and significant cost saving. We first propose a fault-tolerant model for web applications provisioned by spot instances. Based on that, we devise novel auto-scaling polices for hourly billed cloud markets. We implemented the proposed model and policies both on a simulation testbed for repeatable validation and Amazon EC2. The experiments on the simulation testbed and the real platform against the benchmarks show that the proposed approach can greatly reduce resource cost and still achieve satisfactory Quality of Service (QoS) in terms of response time and availability

arXiv.org e-Print Archive

Western Sydney ResearchDirect

Automating Fault Tolerance in High-Performance Computational Biological Jobs Using Multi-Agent Approaches

Author: Alexandrov Vassil
McKee Gerard
Varghese Blesson
Publication venue: 'Elsevier BV'
Publication date: 03/03/2014
Field of study

Background: Large-scale biological jobs on high-performance computing systems require manual intervention if one or more computing cores on which they execute fail. This places not only a cost on the maintenance of the job, but also a cost on the time taken for reinstating the job and the risk of losing data and execution accomplished by the job before it failed. Approaches which can proactively detect computing core failures and take action to relocate the computing core's job onto reliable cores can make a significant step towards automating fault tolerance. Method: This paper describes an experimental investigation into the use of multi-agent approaches for fault tolerance. Two approaches are studied, the first at the job level and the second at the core level. The approaches are investigated for single core failure scenarios that can occur in the execution of parallel reduction algorithms on computer clusters. A third approach is proposed that incorporates multi-agent technology both at the job and core level. Experiments are pursued in the context of genome searching, a popular computational biology application. Result: The key conclusion is that the approaches proposed are feasible for automating fault tolerance in high-performance computing systems with minimal human intervention. In a typical experiment in which the fault tolerance is studied, centralised and decentralised checkpointing approaches on an average add 90% to the actual time for executing the job. On the other hand, in the same experiment the multi-agent approaches add only 10% to the overall execution time.Comment: Computers in Biology and Medicin

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

Crossref

University of St. Andrews - Pure

St Andrews Research Repository