Search CORE

2,344 research outputs found

Study of fault-tolerant software technology

Author: Broglio C.
Goldberg J.
Hitt E.
Levitt K.
Slivinski T.
Webb J.
Wild C.
Publication venue
Publication date
Field of study

Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance

NASA Technical Reports Server

Collecting and Analyzing Failure Data of Bluetooth Personal Area Networks

Author: CINQUE MARCELLO
COTRONEO DOMENICO
RUSSO STEFANO
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

This work presents a failure data analysis campaign on Bluetooth Personal Area Networks (PANs) conducted on two kind of heterogeneous testbeds (working for more than one year). The obtained results reveal how failures distribution are characterized and suggest how to improve the dependability of Bluetooth PANs. Specically, we dene the failure model and we then identify the most effective recovery actions and masking strategies that can be adopted for each failure. We then integrate the discovered recovery actions and masking strategies in our testbeds, improving the availability and the reliability of 3.64% (up to 36.6%) and 202% (referred to the Mean Time To Failure), respectively

Archivio della ricerca - Università degli studi di Napoli Federico II

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

Author: Egwutuoha Ifeanyi Paulinus
Publication venue: Faculty of Engineering and Information Technologies, School of Electrical and Information Engineering
Publication date: 01/01/2014
Field of study

High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems

Sydney eScholarship

An automated wrapper-based approach to the design of dependable software

Author: Jhumka Arshad
Leeke Matthew
Publication venue
Publication date: 01/01/2011
Field of study

The design of dependable software systems invariably comprises two main activities: (i) the design of dependability mechanisms, and (ii) the location of dependability mechanisms. It has been shown that these activities are intrinsically difficult. In this paper we propose an automated wrapper-based methodology to circumvent the problems associated with the design and location of dependability mechanisms. To achieve this we replicate important variables so that they can be used as part of standard, efficient dependability mechanisms. These well-understood mechanisms are then deployed in all relevant locations. To validate the proposed methodology we apply it to three complex software systems, evaluating the dependability enhancement and execution overhead in each case. The results generated demonstrate that the system failure rate of a wrapped software system can be several orders of magnitude lower than that of an unwrapped equivalent

CiteSeerX

Warwick Research Archives Portal Repository

Secure Virtualization and Multicore Platforms State-of-the-Art report

Author: Douglas Heradon
Gehrmann Christian
Publication venue: Swedish Institute of Computer Science
Publication date: 01/01/2009
Field of study

SVaM

CiteSeerX

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

Separation kernel robustness testing : the xtratum case study

Author: Carrascosa Elena
Crespo Alfons
Grixti Stephen
Hernek Maria
IEEE International Conference on Cluster Computing
Masmano Miguel
Sammut Nicholas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

With time and space partitioned architectures becoming increasingly appealing to the European space sector, the dependability of separation kernel technology is a key factor to its applicability in European Space Agency projects. This paper explores the potential of the data type fault model, which injects faults through the Application Program Interface, in separation kernel robustness testing. This fault injection methodology has been tailored to investigate its relevance in uncovering vulnerabilities within separation kernels and potentially contributing towards fault removal campaigns within this domain. This is demonstrated through a robustness testing case study of the XtratuM separation kernel for SPARC LEON3 processors. The robustness campaign exposed a number of vulnerabilities in XtratuM, exhibiting the potential benefits of using such a methodology for the robustness assessment of separation kernels.peer-reviewe

OAR@UM

Crossref