Search CORE

18,376 research outputs found

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

Giving Neurons to Sensors: An Approach to QoS Management Through Artificial Intelligence in Wireless Networks

Author: Barbancho Concejero Antonio
Barbancho Concejero Julio
León de Mora Carlos
Molina Cantero Francisco Javier
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

For the latest ten years, many authors have focused their investigations in wireless sensor networks. Different researching issues have been extensively developed: power consumption, MAC protocols, selforganizing network algorithms, data-aggregation schemes, routing protocols, QoS management, etc. Due to the constraints on data processing and power consumption, the use of artificial intelligence has been historically discarded. However, in some special scenarios the features of neural networks are appropriate to develop complex tasks such as path discovery. In this paper, we explore the performance of two very well known routing paradigms, directed diffusion and Energy-Aware Routing, and our routing algorithm, named SIR, which has the novelty of being based on the introduction of neural networks in every sensor node. Extensive simulations over our wireless sensor network simulator, OLIMPO, have been carried out to study the efficiency of the introduction of neural networks. A comparison of the results obtained with every routing protocol is analyzed. This paper attempts to encourage the use of artificial intelligence techniques in wireless sensor nodes

idUS. Depósito de Investigación Universidad de Sevilla

Reliability models for HPC applications and a Cloud economic model

Author: Thanakornworakij Thanadech
Publication venue: Louisiana Tech Digital Commons
Publication date: 01/07/2012
Field of study

With the enormous number of computing resources in HPC and Cloud systems, failures become a major concern. Therefore, failure behaviors such as reliability, failure rate, and mean time to failure need to be understood to manage such a large system efficiently. This dissertation makes three major contributions in HPC and Cloud studies. First, a reliability model with correlated failures in a k-node system for HPC applications is studied. This model is extended to improve accuracy by accounting for failure correlation. Marshall-Olkin Multivariate Weibull distribution is improved by excess life, conditional Weibull, to better estimate system reliability. Also, the univariate method is proposed for estimating Marshall-Olkin Multivariate Weibull parameters of a system composed of a large number of nodes. Then, failure rate, and mean time to failure are derived. The model is validated by using log data from Blue Gene/L system at LLNL. Results show that when failures of nodes in the system have correlation, the system becomes less reliable. Secondly, a reliability model of Cloud computing is proposed. The reliability model and mean time to failure and failure rate are estimated based on a system of k nodes and s virtual machines under four scenarios: 1) Hardware components fail independently, and software components fail independently; 2) software components fail independently, and hardware components are correlated in failure; 3) correlated software failure and independent hardware failure; and 4) dependent software and hardware failure. Results show that if the failure of the nodes and/or software in the system possesses a degree of dependency, the system becomes less reliable. Also, an increase in the number of computing components decreases the reliability of the system. Finally, an economic model for a Cloud service provider is proposed. This economic model aims at maximizing profit based on the right pricing and rightsizing in the Cloud data center. Total cost is a key element in the model and it is analyzed by considering the Total Cost of Ownership (TCO) of the Cloud

Louisiana Tech Digital Commons

Design of an integrated airframe/propulsion control system architecture

Author: Cohen Gerald C.
Lee C. William
Strickland Michael J.
Torkelson Thomas C.
Publication venue
Publication date
Field of study

The design of an integrated airframe/propulsion control system architecture is described. The design is based on a prevalidation methodology that uses both reliability and performance. A detailed account is given for the testing associated with a subset of the architecture and concludes with general observations of applying the methodology to the architecture

NASA Technical Reports Server