Search CORE

11 research outputs found

A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems

Author: Treaster Michael
Publication venue
Publication date: 31/12/2004
Field of study

Supercomputing systems today often come in the form of large numbers of commodity systems linked together into a computing cluster. These systems, like any distributed system, can have large numbers of independent hardware components cooperating or collaborating on a computation. Unfortunately, any of this vast number of components can fail at any time, resulting in potentially erroneous output. In order to improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resilience to these kinds of system faults. This survey provides an overview of these various fault-tolerance techniques.Comment: 11 page

arXiv.org e-Print Archive

CiteSeerX

Reducing the cost of group communication with semantic view synchrony

Author: Oliveira Rui Carlos Mendes de
Pereira José
Rodrigues Luis
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2002
Field of study

View Synchrony (VS) is a powerful abstraction in the design and implementation of de- pendable distributed systems. By ensuring that processes deliver the same set of messages in each view, it allows them to maintain consistency across membership changes. However, experience indicates that it is hard to combine strong reliability guarantees as offered by VS with stable high performance. In this paper we propose a novel abstraction, Semantic View Synchrony (SVS), that exploits the application's semantics to cope with high throughput applications. This is achieved by allowing some messages to be dropped while still preserving consistency when new views are installed. Thus, SVS inherits the elegance of view synchronous communi- cation. The paper describes how SVS can be implemented and illustrates its usefulness in the context of distributed multi-player games

CiteSeerX

Universidade do Minho: RepositoriUM

Universidade de Lisboa: Repositório.UL

MSF-Model: Modeling Metastable Failures in Replicated Storage Systems

Author: Habibi Farzad
Lorido-Botran Tania
Nawab Faisal
Showail Ahmad
Sturman Daniel C.
Publication venue
Publication date: 28/09/2023
Field of study

Metastable failure is a recent abstraction of a pattern of failures that occurs frequently in real-world distributed storage systems. In this paper, we propose a formal analysis and modeling of metastable failures in replicated storage systems. We focus on a foundational problem in distributed systems -- the problem of consensus -- to have an impact on a large class of systems. Our main contribution is the development of a queuing-based analytical model, MSF-Model, that can be used to characterize and predict metastable failures. MSF-Model integrates novel modeling concepts that allow modeling metastable failures which was interactable to model prior to our work. We also perform real experiments to reproduce and validate our model. Our real experiments show that MSF-Model predicts metastable failures with high accuracy by comparing the real experiment with the predictions from the queuing-based model

arXiv.org e-Print Archive

Achieving Reliable Parallel Performance in a VoD Storage Server Using Randomization and Replication

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Crossref

Using Virtualization to Improve Software Rejuvenation

Author: Artur Andrzejak
Javier Alonso
Jordi Torres
Luis Moura Silva
Paulo Silva
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

In this paper, we present an approach for software rejuvenation based on automated self-healing techniques that can be easily applied to off-the-shelf Application Servers and Internet sites. Software aging and transient failures are detected through continuous monitoring of system data and performability metrics of the application server. If some anomalous behavior is identified the system triggers an automatic rejuvenation action. This self-healing scheme is meant to be the less disruptive as possible for the running service and to get a zero downtime for most of the cases. In our scheme, we exploit the usage of virtualization to optimize the self-recovery actions. The techniques described in this paper have been tested with a set of open-source Linux tools and the XEN virtualization middleware. We conducted an experimental study with two applications benchmarks (Tomcat/Axis and TPC-W). Our results demonstrate that virtualization can be extremely helpful for software rejuvenation and fail-over in the occurrence of transient application failures and software aging. 1

CiteSeerX

Crossref

Proactive software rejuvenation solution for web enviroments on virtualized platforms

Author: Alonso López Javier
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2011
Field of study

The availability of the Information Technologies for everything, from everywhere, at all times is a growing requirement. We use information Technologies from common and social tasks to critical tasks like managing nuclear power plants or even the International Space Station (ISS). However, the availability of IT infrastructures is still a huge challenge nowadays. In a quick look around news, we can find reports of corporate outage, affecting millions of users and impacting on the revenue and image of the companies. It is well known that, currently, computer system outages are more often due to software faults, than hardware faults. Several studies have reported that one of the causes of unplanned software outages is the software aging phenomenon. This term refers to the accumulation of errors, usually causing resource contention, during long running application executions, like web applications, which normally cause applications/systems to hang or crash. Gradual performance degradation could also accompany software aging phenomena. The software aging phenomena are often related to memory bloating/ leaks, unterminated threads, data corruption, unreleased file-locks or overruns. We can find several examples of software aging in the industry. The work presented in this thesis aims to offer a proactive and predictive software rejuvenation solution for Internet Services against software aging caused by resource exhaustion. To this end, we first present a threshold based proactive rejuvenation to avoid the consequences of software aging. This first approach has some limitations, but the most important of them it is the need to know a priori the resource or resources involved in the crash and the critical condition values. Moreover, we need some expertise to fix the threshold value to trigger the rejuvenation action. Due to these limitations, we have evaluated the use of Machine Learning to overcome the weaknesses of our first approach to obtain a proactive and predictive solution. Finally, the current and increasing tendency to use virtualization technologies to improve the resource utilization has made traditional data centers turn into virtualized data centers or platforms. We have used a Mathematical Programming approach to virtual machine allocation and migration to optimize the resources, accepting as many services as possible on the platform while at the same time, guaranteeing the availability (via our software rejuvenation proposal) of the services deployed against the software aging phenomena. The thesis is supported by an exhaustive experimental evaluation that proves the effectiveness and feasibility of our proposals for current systems

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Error management in ATLAS TDAQ : an intelligent systems approach

Author: Sloper John Erik
Publication venue
Publication date
Field of study

This thesis is concerned with the use of intelligent system techniques (IST) within a large distributed software system, specifically the ATLAS TDAQ system which has been developed and is currently in use at the European Laboratory for Particle Physics(CERN). The overall aim is to investigate and evaluate a range of ITS techniques in order to improve the error management system (EMS) currently used within the TDAQ system via error detection and classification. The thesis work will provide a reference for future research and development of such methods in the TDAQ system. The thesis begins by describing the TDAQ system and the existing EMS, with a focus on the underlying expert system approach, in order to identify areas where improvements can be made using IST techniques. It then discusses measures of evaluating error detection and classification techniques and the factors specific to the TDAQ system. Error conditions are then simulated in a controlled manner using an experimental setup and datasets were gathered from two different sources. Analysis and processing of the datasets using statistical and ITS techniques shows that clusters exists in the data corresponding to the different simulated errors. Different ITS techniques are applied to the gathered datasets in order to realise an error detection model. These techniques include Artificial Neural Networks (ANNs), Support Vector Machines (SVMs) and Cartesian Genetic Programming (CGP) and a comparison of the respective advantages and disadvantages is made. The principle conclusions from this work are that IST can be successfully used to detect errors in the ATLAS TDAQ system and thus can provide a tool to improve the overall error management system. It is of particular importance that the IST can be used without having a detailed knowledge of the system, as the ATLAS TDAQ is too complex for a single person to have complete understanding of. The results of this research will benefit researchers developing and evaluating IST techniques in similar large scale distributed systems

Warwick Research Archives Portal Repository

Error management in ATLAS TDAQ : an intelligent systems approach

Author: Sloper John Erik
Publication venue
Publication date: 01/01/2010
Field of study

CERN Document Server

OpenGrey Repository

Automated Service-Oriented Impact Analysis and Recovery Alternative Selection

Author: Schmitz David
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 16/07/2008
Field of study

Digitale Hochschulschriften der LMU