Search CORE

3,213 research outputs found

Checkpointing as a Service in Heterogeneous Cloud Environments

Author: Cao Jiajun
Cooperman Gene
Morin Christine
Simonin Matthieu
Publication venue
Publication date: 07/11/2014
Field of study

A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Recommended from our members

Optimising Fault Tolerance in Real-time Cloud Computing IaaS Environment

Author: Awan Irfan U.
Kiran Mariam
Maiyama Kabiru M.
Mohammed Bashir
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/08/2016
Field of study

YesFault tolerance is the ability of a system to respond swiftly to an unexpected failure. Failures in a cloud computing environment are normal rather than exceptional, but fault detection and system recovery in a real time cloud system is a crucial issue. To deal with this problem and to minimize the risk of failure, an optimal fault tolerance mechanism was introduced where fault tolerance was achieved using the combination of the Cloud Master, Compute nodes, Cloud load balancer, Selection mechanism and Cloud Fault handler. In this paper, we proposed an optimized fault tolerance approach where a model is designed to tolerate faults based on the reliability of each compute node (virtual machine) and can be replaced if the performance is not optimal. Preliminary test of our algorithm indicates that the rate of increase in pass rate exceeds the decrease in failure rate and it also considers forward and backward recovery using diverse software tools. Our results obtained are demonstrated through experimental validation thereby laying a foundation for a fully fault tolerant IaaS Cloud environment, which suggests a good performance of our model compared to current existing approaches.Petroleum Technology Development Fund (PTDF

Bradford Scholars

IaaS Cloud como entorno virtual de experimentación en el análisis del checkpoint

Author: Franco Daniel
Gomez-Sanchez Pilar
León Betzabeth
Luque Enrique
Rexachs del Rosario Dolores
Publication venue
Publication date: 19/12/2019
Field of study

Cloud Computing offers the possibility of computing resources, allowing remote access to software, storage and data processing through the Internet. Infrastructures as a Service (IaaS), it is a flexible space which can be used as an experimental environment, in which experiments can be carried out similar to a real environment, such as in a cluster can be carried out. Before making installations and changes in a production cluster or select resource in the cloud, it is important to analyze the impact of this change. For this reason we propose using the cloud to carry out the study of previous viability. In this paper, we observe the viability of using the cloud to analyze the behavior of the Checkpoint as one of the Fault Tolerance strategies, establishing the differences that exist in the information generated in a real environment (cluster) and a virtual environment (cloud). The results obtained show that due to the variability of the cloud, the impact on the benefits cannot be analyzed. However, the cloud is suitable for extracting the spatial and temporal behavior pattern of the checkpoint, which helps to characterize it and this will help us to know the right configuration and the development of methodologies and tools that simulate and predict the execution of the checkpoint in a real environment.El Cloud Computing ofrece la posibilidad de recursos informáticos, que permiten el acceso remoto a software, almacenamiento y procesamiento de datos a través de Internet; IaaS, es un espacio flexible que se puede utilizar como un entorno experimental, en el que se pueden llevar a cabo experimentos similares a los de un entorno real, como un clúster. Antes de realizar instalaciones y cambios en un clúster de producción o de seleccionar recursos en el cloud, es importante analizar el impacto de este cambio. Por este motivo se propone utilizar la nube para realizar el estudio de viabilidad previa. En este documento observamos la posibilidad de utilizar la nube para analizar el comportamiento del checkpoint como una de las estrategias de tolerancia a fallos, estableciendo las similitudes y diferencias que existen en la información generada en un entorno real (clúster) y un entorno virtual (nube). Los resultados obtenidos muestran que, debido a la variabilidad de la nube, no se puede analizar el impacto en las prestaciones, pero la nube es adecuada para extraer el patrón de comportamiento espacial y temporal del checkpoint. Caracterizar el comportamiento del checkpoint ayudará a configurar el sistema, teniendo en cuenta los recursos extra que se necesitan y el impacto en función de la aplicación y los recursos seleccionados.Facultad de Informátic

IaaS Cloud como entorno virtual de experimentación en el análisis del checkpoint

Author: Franco Daniel
Gomez-Sanchez Pilar
León Betzabeth
Luque Enrique
Rexachs del Rosario Dolores
Publication venue
Publication date: 01/10/2019
Field of study

RELEASE: A High-level Paradigm for Reliable Large-scale Server Software

Author: A. Leung
C. Hewitt
D. Dewolfs
D. Ungar
G. Agha
G. Germain
H. Rajan
J. Zhao
K. Sagonas
L. Seiler
M. Snir
R. Chandra
R.K. Karmani
S. Srinivasan
T. Arts
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Erlang is a functional language with a much-emulated model for building reliable distributed systems. This paper outlines the RELEASE project, and describes the progress in the first six months. The project aim is to scale the Erlang’s radical concurrency-oriented programming paradigm to build reliable general-purpose software, such as server-based systems, on massively parallel machines. Currently Erlang has inherently scalable computation and reliability models, but in practice scalability is constrained by aspects of the language and virtual machine. We are working at three levels to address these challenges: evolving the Erlang virtual machine so that it can work effectively on large scale multicore systems; evolving the language to Scalable Distributed (SD) Erlang; developing a scalable Erlang infrastructure to integrate multiple, heterogeneous clusters. We are also developing state of the art tools that allow programmers to understand the behaviour of massively parallel SD Erlang programs. We will demonstrate the effectiveness of the RELEASE approach using demonstrators and two large case studies on a Blue Gene

CiteSeerX

Crossref

Kent Academic Repository

Fail Over Strategy for Fault Tolerance in Cloud Computing Environment

Author: Agbaria
Alshareef
Amoon
Bala
Bertolli
Bilal
Bin
Bin Hong
Chen
Chen
Chtepen
Elliott
Fu
Greenberg
Jung
Kaur
Kim
Malik
Maloney
Nazari Cheraghlou
Okorafor
Pantic
Paul
Pei
Qiang
Salehi
Sen
Sheng
Singh
Singh
Siva Sathya
Sun
Publication venue: 'Wiley'
Publication date: 05/04/2017
Field of study

YesCloud fault tolerance is an important issue in cloud computing platforms and applications. In the event of an unexpected system failure or malfunction, a robust fault-tolerant design may allow the cloud to continue functioning correctly possibly at a reduced level instead of failing completely. To ensure high availability of critical cloud services, the application execution and hardware performance, various fault tolerant techniques exist for building self-autonomous cloud systems. In comparison to current approaches, this paper proposes a more robust and reliable architecture using optimal checkpointing strategy to ensure high system availability and reduced system task service finish time. Using pass rates and virtualised mechanisms, the proposed Smart Failover Strategy (SFS) scheme uses components such as Cloud fault manager, Cloud controller, Cloud load balancer and a selection mechanism, providing fault tolerance via redundancy, optimized selection and checkpointing. In our approach, the Cloud fault manager repairs faults generated before the task time deadline is reached, blocking unrecoverable faulty nodes as well as their virtual nodes. This scheme is also able to remove temporary software faults from recoverable faulty nodes, thereby making them available for future request. We argue that the proposed SFS algorithm makes the system highly fault tolerant by considering forward and backward recovery using diverse software tools. Compared to existing approaches, preliminary experiment of the SFS algorithm indicate an increase in pass rates and a consequent decrease in failure rates, showing an overall good performance in task allocations. We present these results using experimental validation tools with comparison to other techniques, laying a foundation for a fully fault tolerant IaaS Cloud environment

Crossref

eScholarship - University of California

Bradford Scholars