Search CORE

8,582 research outputs found

FADI: a fault-tolerant environment for open distributed computing

Author: A. Bargiela
BARGIELA
ELNOZAHY
JANAKIRAMAN
LAMOTTE
OSMAN
OSMAN
PLANK
POWELL
RIBERIO
SENS
SILVA
T. Osman
Publication venue: 'Institution of Engineering and Technology (IET)'
Publication date: 01/01/2000
Field of study

FADI is a complete programming environment that serves the reliable execution of distributed application programs. FADI encompasses all aspects of modern fault-tolerant distributed computing. The built-in user-transparent error detection mechanism covers processor node crashes and hardware transient failures. The mechanism also integrates user-assisted error checks into the system failure model. The nucleus non-blocking checkpointing mechanism combined with a novel selective message logging technique delivers an efficient, low-overhead backup and recovery mechanism for distributed processes. FADI also provides means for remote automatic process allocation on the distributed system nodes

ETS (Efficient, Transparent, and Secured) Self-healing Service for Pervasive Computing Applications

Author: Ahamed Sheikh Iqbal
Ahmed Shameem
Sharmin Moushumi
Publication venue: e-Publications@Marquette
Publication date: 01/05/2007
Field of study

To ensure smooth functioning of numerous handheld devices anywhere anytime, the importance of self-healing mechanism cannot be overlooked. Incorporation of efficient fault detection and recovery in device itself is the quest for long but there is no existing self-healing scheme for devices running in pervasive computing environments that can be claimed as the ultimate solution. Moreover, the highest degree of transparency, security and privacy attainability should also be maintained. ETS Self-healing service, an integral part of our developing middleware named MARKS (Middleware Adaptability for Resource discovery, Knowledge usability, and Self-healing), holds promise for offering all of those functionalities

ISIS and META projects

Author: Birman Kenneth
Cooper Robert
Marzullo Keith
Publication venue
Publication date
Field of study

The ISIS project has developed a new methodology, virtual synchony, for writing robust distributed software. High performance multicast, large scale applications, and wide area networks are the focus of interest. Several interesting applications that exploit the strengths of ISIS, including an NFS-compatible replicated file system, are being developed. The META project is distributed control in a soft real-time environment incorporating feedback. This domain encompasses examples as diverse as monitoring inventory and consumption on a factory floor, and performing load-balancing on a distributed computing system. One of the first uses of META is for distributed application management: the tasks of configuring a distributed program, dynamically adapting to failures, and monitoring its performance. Recent progress and current plans are reported

Programming your way out of the past: ISIS and the META Project

Author: Birman Kenneth P.
Marzullo Keith
Publication venue
Publication date
Field of study

The ISIS distributed programming system and the META Project are described. The ISIS programming toolkit is an aid to low-level programming that makes it easy to build fault-tolerant distributed applications that exploit replication and concurrent execution. The META Project is reexamining high-level mechanisms such as the filesystem, shell language, and administration tools in distributed systems

A FPGA-Based Reconfigurable Software Architecture for Highly Dependable Systems

Author: Di Carlo Stefano
Prinetto Paolo Ernesto
Scionti A.
Publication venue: IEEE Computer Society
Publication date: 01/01/2009
Field of study

Nowadays, systems-on-chip are commonly equipped with reconfigurable hardware. The use of hybrid architectures based on a mixture of general purpose processors and reconfigurable components has gained importance across the scientific community allowing a significant improvement of computational performance. Along with the demand for performance, the great sensitivity of reconfigurable hardware devices to physical defects lead to the request of highly dependable and fault tolerant systems. This paper proposes an FPGA-based reconfigurable software architecture able to abstract the underlying hardware platform giving an homogeneous view of it. The abstraction mechanism is used to implement fault tolerance mechanisms with a minimum impact on the system performanc

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Checkpointing as a Service in Heterogeneous Cloud Environments

Author: Cao Jiajun
Cooperman Gene
Morin Christine
Simonin Matthieu
Publication venue
Publication date: 07/11/2014
Field of study

A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

INRIA a CCSD electronic archive server

HAL-Rennes 1