Search CORE

1,719 research outputs found

Fault tolerant architectures for integrated aircraft electronics systems

Author: Levitt K. N.
Melliar-Smith P. M.
Schwartz R. L.
Publication venue
Publication date
Field of study

Work into possible architectures for future flight control computer systems is described. Ada for Fault-Tolerant Systems, the NETS Network Error-Tolerant System architecture, and voting in asynchronous systems are covered

NASA Technical Reports Server

Fault-Tolerant Adaptive Parallel and Distributed Simulation

Author: Armaroli Lorenzo
D'Angelo Gabriele
Ferretti Stefano
Marzolla Moreno
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Discrete Event Simulation is a widely used technique that is used to model and analyze complex systems in many fields of science and engineering. The increasingly large size of simulation models poses a serious computational challenge, since the time needed to run a simulation can be prohibitively large. For this reason, Parallel and Distributes Simulation techniques have been proposed to take advantage of multiple execution units which are found in multicore processors, cluster of workstations or HPC systems. The current generation of HPC systems includes hundreds of thousands of computing nodes and a vast amount of ancillary components. Despite improvements in manufacturing processes, failures of some components are frequent, and the situation will get worse as larger systems are built. In this paper we describe FT-GAIA, a software-based fault-tolerant extension of the GAIA/ART\`IS parallel simulation middleware. FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes; furthermore, FT-GAIA offers some protection against byzantine failures since synchronization messages are replicated as well, so that the receiving entity can identify and discard corrupted messages. We provide an experimental evaluation of FT-GAIA on a running prototype. Results show that a high degree of fault tolerance can be achieved, at the cost of a moderate increase in the computational load of the execution units.Comment: Proceedings of the IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2016

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Fault-tolerant computer study

Author: Avizienis A. A.
Ercegovac M. D.
Rennels D. A.
Publication venue
Publication date
Field of study

A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed

NASA Technical Reports Server

Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

Author: D'Angelo Gabriele
Ferretti Stefano
Marzolla Moreno
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

This paper presents FT-GAIA, a software-based fault-tolerant parallel and distributed simulation middleware. FT-GAIA has being designed to reliably handle Parallel And Distributed Simulation (PADS) models, which are needed to properly simulate and analyze complex systems arising in any kind of scientific or engineering field. PADS takes advantage of multiple execution units run in multicore processors, cluster of workstations or HPC systems. However, large computing systems, such as HPC systems that include hundreds of thousands of computing nodes, have to handle frequent failures of some components. To cope with this issue, FT-GAIA transparently replicates simulation entities and distributes them on multiple execution nodes. This allows the simulation to tolerate crash-failures of computing nodes. Moreover, FT-GAIA offers some protection against Byzantine failures, since interaction messages among the simulated entities are replicated as well, so that the receiving entity can identify and discard corrupted messages. Results from an analytical model and from an experimental evaluation show that FT-GAIA provides a high degree of fault tolerance, at the cost of a moderate increase in the computational load of the execution units.Comment: arXiv admin note: substantial text overlap with arXiv:1606.0731

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Urbino

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Low latency reconfiguration mechanism for fine-grained processor internal functional units

Author: Nolte Jörg
Segabinazzi Ferreira Raphael
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/11/2019
Field of study

The strive for performance, low power consumption, and less chip area have been diminishing the reliability and the time to fault occurrences due to wear out of electronic devices. Recent research has shown that functional units within processors usually execute a different amount of operations when running programs. Therefore, these units present different individual wear out during their lifetime. Most existent schemes for re-configuration of processors due to fault detection and other processor parameters are done at the level of cores which is a costly way to achieve redundancy. This paper presents a low latency (approximately 1 clock cycle) software controlled mechanism to reconfigure units within processor cores according to predefined parameters. Such reconfiguration capability delivers features like wear out balance of processor functional units, configuration of units according to the criticality of tasks running on an operating system and configurations to gain in performance (e.g. parallel execution) when possible. The focus of this paper is to show the implemented low latency reconfiguration mechanism and highlight its possible main features

Crossref

Digitales Repositorium der BTU Cottbus – Senftenberg

EnSuRe: Energy & Accuracy Aware Fault-tolerant Scheduling on Real-time Heterogeneous Systems

Author: Adetomi Adewale
Arslan Tughrul
Ehsan Shoaib
Kasap Server
McDonald-Maier Klaus
Saha Sangeet
Zhai Xiaojun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/06/2021
Field of study

This paper proposes an energy efficient real-time scheduling strategy called EnSuRe, which (i) executes real-time tasks on low power consuming primary processors to enhance the system accuracy by maintaining the deadline and (ii) provides reliability against a fixed number of transient faults by selectively executing backup tasks on high power consuming backup processor. Simulation results reveal that EnSuRe consumes nearly 25% less energy, compared to existing techniques, while satisfying the fault tolerance requirements. EnSuRe is also able to achieve 75% system accuracy with 50% system utilisation. Further, the obtained simulation outcomes are validated on benchmark tasks via a fault injection framework on Xilinx ZYNQ APSoC heterogeneous dual core platform

University of Essex Research Repository

Southampton (e-Prints Soton)

Coventry University Pure Portal

Mixed-mode multicore reliability

Author: Gurindar S. Sohi
Koushik Chakraborty
Philip M. Wells
Publication venue: IEEE Computer Society
Publication date: 01/01/2009
Field of study

Future processors are expected to observe increasing rates of hardware faults. Using Dual-Modular Redundancy (DMR), two cores of a multicore can be loosely coupled to redundantly execute a single software thread, providing very high coverage from many difference sources of faults. This reliability, however, comes at a high price in terms of per-thread IPC and overall system throughput. We make the observation that a user may want to run both applications requiring high reliability, such as financial software, and more fault tolerant applications requiring high performance, such as media or web software, on the same machine at the same time. Yet a traditional DMR system must fully operate in redundant mode whenever any application requires high reliability. This paper proposes a Mixed-Mode Multicore (MMM), which enables most applications, including the system software, to run with high reliability in DMR mode, while applications that need high performance can avoid the penalty of DMR. Though conceptually simple, two key challenges arise: 1) care must be taken to protect reliable applications from any faults occurring to applications running in high performance mode, and 2) the desire to execute additional independent software threads for a performance application complicates the scheduling of computation to cores. After solving these issues, an MMM is shown to improve overall system performance, compared to a traditional DMR system, by approximately 2X when one reliable and one performance application are concurrently executing

CiteSeerX

Crossref