18,376 research outputs found
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
Supercomputing systems today often come in the form of large numbers of
commodity systems linked together into a computing cluster. These systems, like
any distributed system, can have large numbers of independent hardware
components cooperating or collaborating on a computation. Unfortunately, any of
this vast number of components can fail at any time, resulting in potentially
erroneous output. In order to improve the robustness of supercomputing
applications in the presence of failures, many techniques have been developed
to provide resilience to these kinds of system faults. This survey provides an
overview of these various fault-tolerance techniques.Comment: 11 page
Giving Neurons to Sensors: An Approach to QoS Management Through Artificial Intelligence in Wireless Networks
For the latest ten years, many authors have focused their investigations
in wireless sensor networks. Different researching issues have
been extensively developed: power consumption, MAC protocols, selforganizing
network algorithms, data-aggregation schemes, routing protocols,
QoS management, etc. Due to the constraints on data processing
and power consumption, the use of artificial intelligence has been historically
discarded. However, in some special scenarios the features of
neural networks are appropriate to develop complex tasks such as path
discovery. In this paper, we explore the performance of two very well
known routing paradigms, directed diffusion and Energy-Aware Routing,
and our routing algorithm, named SIR, which has the novelty of being
based on the introduction of neural networks in every sensor node. Extensive
simulations over our wireless sensor network simulator, OLIMPO,
have been carried out to study the efficiency of the introduction of neural
networks. A comparison of the results obtained with every routing protocol
is analyzed. This paper attempts to encourage the use of artificial
intelligence techniques in wireless sensor nodes
Reliability models for HPC applications and a Cloud economic model
With the enormous number of computing resources in HPC and Cloud systems, failures become a major concern. Therefore, failure behaviors such as reliability, failure rate, and mean time to failure need to be understood to manage such a large system efficiently.
This dissertation makes three major contributions in HPC and Cloud studies. First, a reliability model with correlated failures in a k-node system for HPC applications is studied. This model is extended to improve accuracy by accounting for failure correlation. Marshall-Olkin Multivariate Weibull distribution is improved by excess life, conditional Weibull, to better estimate system reliability. Also, the univariate method is proposed for estimating Marshall-Olkin Multivariate Weibull parameters of a system composed of a large number of nodes. Then, failure rate, and mean time to failure are derived. The model is validated by using log data from Blue Gene/L system at LLNL. Results show that when failures of nodes in the system have correlation, the system becomes less reliable.
Secondly, a reliability model of Cloud computing is proposed. The reliability model and mean time to failure and failure rate are estimated based on a system of k nodes and s virtual machines under four scenarios: 1) Hardware components fail independently, and software components fail independently; 2) software components fail independently, and hardware components are correlated in failure; 3) correlated software failure and independent hardware failure; and 4) dependent software and hardware failure. Results show that if the failure of the nodes and/or software in the system possesses a degree of dependency, the system becomes less reliable. Also, an increase in the number of computing components decreases the reliability of the system.
Finally, an economic model for a Cloud service provider is proposed. This economic model aims at maximizing profit based on the right pricing and rightsizing in the Cloud data center. Total cost is a key element in the model and it is analyzed by considering the Total Cost of Ownership (TCO) of the Cloud
Design of an integrated airframe/propulsion control system architecture
The design of an integrated airframe/propulsion control system architecture is described. The design is based on a prevalidation methodology that uses both reliability and performance. A detailed account is given for the testing associated with a subset of the architecture and concludes with general observations of applying the methodology to the architecture
- …