Search CORE

33,929 research outputs found

Fault-tolerant distributed computing scheme based on erasure codes

Author: Lacan Jérôme
Publication venue
Publication date: 01/06/2006
Field of study

Some emerging classes of distributed computing systems, such peer-to-peer or grid computing computing systems, are composed of heterogeneous computing resources potentially unreliable. This paper proposes to use erasure codes to improve the fault-tolerance of parallel distributed computing applications in this context. A general method to generate redundant processes from a set of parallel processes is presented. This scheme allows the recovery of the result of the application even if some of the processes crash

Open Archive Toulouse Archive Ouverte

Computing in the RAIN: a reliable array of independent nodes

Author: Bohossian Vasken
Bruck Jehoshua
Fan Chenggong C.
LeMahieu Paul S.
Riedel Marc D.
Xu Lihao
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology

CiteSeerX

Caltech Authors

Dynamic fault tolerant grid workflow in the water threat management project

Author: Moon Young Suk
Publication venue: RIT Scholar Works
Publication date: 01/01/2010
Field of study

Achieving fault tolerance is an inevitable problem in distributed systems, with it becoming more challenging in decentralized, heterogeneous, and dynamic-environment systems such as a Grid. When deploying applications requires time-criticality, how to allocate resources for jobs in a fault-tolerant manner is an important issue for the delivery of the services. The Water Threat Management project is a research to find solutions for the contamination incidents problems in urban water distribution systems, and it involves the development of the cyberinfrastructure in a Grid environment. To handle such urgent events properly, the deployment of the system demands real-time processing without the failure. Our approach of integrating a fault-tolerant framework into a Water Threat Management system provides fault tolerance at the queuing stage rather than the job-execution stage by scheduling jobs in fault-tolerant ways. This includes the development of the batch queuing system in the Cyberaide Shell project. In addition, we present a dynamic workflow in the Water Threat Management system that can reduce the queue wait time in the changing environment

RIT Scholar Works

Fault Tolerant Real Time Dynamic Scheduling Algorithm For Heterogeneous Distributed System

Author: Ekka A A
Publication venue
Publication date: 01/01/2007
Field of study

Fault-tolerance becomes an important key to establish dependability in Real Time Distributed Systems (RTDS). In fault-tolerant Real Time Distributed systems, detection of fault and its recovery should be executed in timely manner so that in spite of fault occurrences the intended output of real-time computations always take place on time. Hardware and software redundancy are well-known e ective methods for faulttolerance, where extra hard ware (e.g., processors, communication links) and software (e.g., tasks, messages) are added into the system to deal with faults. Performances of RTDS are mostly guided by eciency of scheduling algorithm and schedulability analysis are performed on the system to ensure the timing constrains. This thesis examines the scenarios where a real time system requires very little redundant hardware resources to tolerate failures in heterogeneous real time distributed systems with point-to-point communication links. Fault tolerance can be achieved by..

ethesis@nitr

Integrated Design Tools for Embedded Control Systems

Author: Broenink Jan F.
Hilderink Gerald H.
Jovanovic Dusko S.
Publication venue: STW Technology Foundation
Publication date: 01/01/2001
Field of study

Currently, computer-based control systems are still being implemented using the same techniques as 10 years ago. The purpose of this project is the development of a design framework, consisting of tools and libraries, which allows the designer to build high reliable heterogeneous real-time embedded systems in a very short time at a fraction of the present day costs. The ultimate focus of current research is on transformation control laws to efficient concurrent algorithms, with concerns about important non-functional real-time control systems demands, such as fault-tolerance, safety,\ud reliability, etc.\ud The approach is based on software implementation of CSP process algebra, in a modern way (pure objectoriented design in Java). Furthermore, it is intended that the tool will support the desirable system-engineering stepwise refinement design approach, relying on past research achievements ¿ the mechatronics design trajectory based on the building-blocks approach, covering all complex (mechatronics) engineering phases: physical system modeling, control law design, embedded control system implementation and real-life realization. Therefore, we expect that this project will result in an\ud adequate tool, with results applicable in a wide range of target hardware platforms, based on common (off-theshelf) distributed heterogeneous (cheap) processing units

University of Twente Research Information

Challenging Anti-fragile Blockchain Applications

Author: González Miguel Alejandro
Monperrus Martin
Rouvoy Romain
Rudametkin Walter
Publication venue: HAL CCSD
Publication date: 23/04/2017
Field of study

International audienceFailures in production are a de facto rule for distributed software systems. In particular, modern distributed systems are composed of heterogeneous building blocks contributed by third parties and guaranteeing the end-to-end resilience is becoming a major challenge. Even though each of these software components can embed fault tolerance or dependability protocols, it remains difficult to assess their effectiveness upon the occurrences of unexpected failures. As part of this work, we propose a new generation of fault injection framework that can be deployed in production to challenge Blockchain-based distributed systems. This paper therefore reports on the state of the art in this area and potential opportunities for novel contributions towards building anti-fragile distributed systems on the Blockchain

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

A model-based approach for automatic recovery from memory leaks in enterprise applications

Author: Wang Zimin
Publication venue: Scholars Junction
Publication date: 06/08/2011
Field of study

Large-scale distributed computing systems such as data centers are hosted on heterogeneous and networked servers that execute in a dynamic and uncertain operating environment, caused by factors such as time-varying user workload and various failures. Therefore, achieving stringent quality-of-service goals is a challenging task, requiring a comprehensive approach to performance control, fault diagnosis, and failure recovery. This work presents a model-based approach for fault management, which integrates limited lookahead control (LLC), diagnosis, and fault-tolerance concepts that: (1) enables systems to adapt to environment variations, (2) maintains the availability and reliability of the system, (3) facilitates system recovery from failures. We focused on memory leak errors in this thesis. A characterization function is designed to detect memory leaks. Then, a LLC is applied to enable the computing system to adapt efficiently to variations in the workload, and to enable the system recover from memory leaks and maintain functionality

Scholars Junction - Mississippi State University Institutional Repository

A language and toolkit for the specification, execution and monitoring of dependable distributed applications

Author: Ranno Frederic
Publication venue: Newcastle University
Publication date: 01/01/1998
Field of study

PhD ThesisThis thesis addresses the problem of specifying the composition of distributed applications out of existing applications, possibly legacy ones. With the automation of business processes on the increase, more and more applications of this kind are being constructed. The resulting applications can be quite complex, usually long-lived and are executed in a heterogeneous environment. In a distributed environment, long-lived activities need support for fault tolerance and dynamic reconfiguration. Indeed, it is likely that the environment where they are run will change (nodes may fail, services may be moved elsewhere or withdrawn) during their execution and the specification will have to be modified. There is also a need for modularity, scalability and openness. However, most of the existing systems only consider part of these requirements. A new area of research, called workflow management has been trying to address these issues. This work first looks at what needs to be addressed to support the specification and execution of these new applications in a heterogeneous, distributed environment. A co- ordination language (scripting language) is developed that fulfils the requirements of specifying the composition and inter-dependencies of distributed applications with the properties of dynamic reconfiguration, fault tolerance, modularity, scalability and openness. The architecture of the overall workflow system and its implementation are then presented. The system has been implemented as a set of CORBA services and the execution environment is built using a transactional workflow management system. Next, the thesis describes the design of a toolkit to specify, execute and monitor distributed applications. The design of the co-ordination language and the toolkit represents the main contribution of the thesis.UK Engineering and Physical Sciences Research Council, CaberNet, Northern Telecom (Nortel)

Newcastle University eTheses

Checkpointing as a Service in Heterogeneous Cloud Environments

Author: Cao Jiajun
Cooperman Gene
Morin Christine
Simonin Matthieu
Publication venue
Publication date: 07/11/2014
Field of study

A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1