3,213 research outputs found
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
Recommended from our members
Optimising Fault Tolerance in Real-time Cloud Computing IaaS Environment
YesFault tolerance is the ability of a system to respond
swiftly to an unexpected failure. Failures in a cloud computing
environment are normal rather than exceptional, but fault
detection and system recovery in a real time cloud system is a
crucial issue. To deal with this problem and to minimize the risk
of failure, an optimal fault tolerance mechanism was introduced
where fault tolerance was achieved using the combination of the
Cloud Master, Compute nodes, Cloud load balancer, Selection
mechanism and Cloud Fault handler. In this paper, we proposed
an optimized fault tolerance approach where a model is designed
to tolerate faults based on the reliability of each compute node
(virtual machine) and can be replaced if the performance is not
optimal. Preliminary test of our algorithm indicates that the rate
of increase in pass rate exceeds the decrease in failure rate and it
also considers forward and backward recovery using diverse
software tools. Our results obtained are demonstrated through
experimental validation thereby laying a foundation for a fully
fault tolerant IaaS Cloud environment, which suggests a good
performance of our model compared to current existing
approaches.Petroleum Technology Development Fund (PTDF
IaaS Cloud como entorno virtual de experimentación en el análisis del checkpoint
Cloud Computing offers the possibility of computing resources, allowing remote access to software, storage and data processing through the Internet.
Infrastructures as a Service (IaaS), it is a flexible space which can be used as an experimental environment, in which experiments can be carried out similar to a real environment, such as in a cluster can be carried out. Before making installations and changes in a production cluster or select resource in the cloud, it is important to analyze the impact of this change. For this reason we propose using the cloud to carry out the study of previous viability. In this paper, we observe the viability of using the cloud to analyze the behavior of the Checkpoint as one of the Fault Tolerance strategies, establishing the differences that exist in the information generated in a real environment (cluster) and a virtual environment (cloud). The results obtained show that due to the variability of the cloud, the impact on the benefits cannot be analyzed. However, the cloud is suitable for extracting the spatial and temporal behavior pattern of the checkpoint, which helps to characterize it and this will help us to know the right configuration and the development of methodologies and tools that simulate and predict the execution of the checkpoint in a real environment.El Cloud Computing ofrece la posibilidad de recursos informáticos, que permiten el acceso remoto a software, almacenamiento y procesamiento de datos a través de Internet; IaaS, es un espacio flexible que se puede utilizar como un entorno experimental, en el que se pueden llevar a cabo experimentos similares a los de un entorno real, como un clúster. Antes de realizar instalaciones y cambios en un clúster de producción o de seleccionar recursos en el cloud, es importante analizar el impacto de este cambio. Por este motivo se propone utilizar la nube para realizar el estudio de viabilidad previa. En este documento observamos la posibilidad de utilizar la nube para analizar el comportamiento del checkpoint como una de las estrategias de tolerancia a fallos, estableciendo las similitudes y diferencias que existen en la información generada en un entorno real (clúster) y un entorno virtual (nube). Los resultados obtenidos muestran que, debido a la variabilidad de la nube, no se puede analizar el impacto en las prestaciones, pero la nube es adecuada para extraer el patrón de comportamiento espacial y temporal del checkpoint.
Caracterizar el comportamiento del checkpoint ayudará a configurar el sistema, teniendo en cuenta los recursos extra que se necesitan y el impacto en función de la aplicación y los recursos seleccionados.Facultad de Informátic
IaaS Cloud como entorno virtual de experimentación en el análisis del checkpoint
Cloud Computing offers the possibility of computing resources, allowing remote access to software, storage and data processing through the Internet.
Infrastructures as a Service (IaaS), it is a flexible space which can be used as an experimental environment, in which experiments can be carried out similar to a real environment, such as in a cluster can be carried out. Before making installations and changes in a production cluster or select resource in the cloud, it is important to analyze the impact of this change. For this reason we propose using the cloud to carry out the study of previous viability. In this paper, we observe the viability of using the cloud to analyze the behavior of the Checkpoint as one of the Fault Tolerance strategies, establishing the differences that exist in the information generated in a real environment (cluster) and a virtual environment (cloud). The results obtained show that due to the variability of the cloud, the impact on the benefits cannot be analyzed. However, the cloud is suitable for extracting the spatial and temporal behavior pattern of the checkpoint, which helps to characterize it and this will help us to know the right configuration and the development of methodologies and tools that simulate and predict the execution of the checkpoint in a real environment.El Cloud Computing ofrece la posibilidad de recursos informáticos, que permiten el acceso remoto a software, almacenamiento y procesamiento de datos a través de Internet; IaaS, es un espacio flexible que se puede utilizar como un entorno experimental, en el que se pueden llevar a cabo experimentos similares a los de un entorno real, como un clúster. Antes de realizar instalaciones y cambios en un clúster de producción o de seleccionar recursos en el cloud, es importante analizar el impacto de este cambio. Por este motivo se propone utilizar la nube para realizar el estudio de viabilidad previa. En este documento observamos la posibilidad de utilizar la nube para analizar el comportamiento del checkpoint como una de las estrategias de tolerancia a fallos, estableciendo las similitudes y diferencias que existen en la información generada en un entorno real (clúster) y un entorno virtual (nube). Los resultados obtenidos muestran que, debido a la variabilidad de la nube, no se puede analizar el impacto en las prestaciones, pero la nube es adecuada para extraer el patrón de comportamiento espacial y temporal del checkpoint.
Caracterizar el comportamiento del checkpoint ayudará a configurar el sistema, teniendo en cuenta los recursos extra que se necesitan y el impacto en función de la aplicación y los recursos seleccionados.Facultad de Informátic
RELEASE: A High-level Paradigm for Reliable Large-scale Server Software
Erlang is a functional language with a much-emulated model for building reliable distributed systems. This paper outlines the RELEASE project, and describes the progress in the first six months. The project aim is to scale the Erlang’s radical concurrency-oriented programming paradigm to build reliable general-purpose software, such as server-based systems, on massively parallel machines. Currently Erlang has inherently scalable computation and reliability models, but in practice scalability is constrained by aspects of the language and virtual machine. We are working at three levels to address these challenges: evolving the Erlang virtual machine so that it can work effectively on large scale multicore systems; evolving the language to Scalable Distributed (SD) Erlang; developing a scalable Erlang infrastructure to integrate multiple, heterogeneous clusters. We are also developing state of the art tools that allow programmers to understand the behaviour of massively parallel SD Erlang programs. We will demonstrate the effectiveness of the RELEASE approach using demonstrators and two large case studies on a Blue Gene
Fail Over Strategy for Fault Tolerance in Cloud Computing Environment
YesCloud fault tolerance is an important issue in cloud computing platforms and applications. In the event of an unexpected
system failure or malfunction, a robust fault-tolerant design may allow the cloud to continue functioning correctly
possibly at a reduced level instead of failing completely. To ensure high availability of critical cloud services, the
application execution and hardware performance, various fault tolerant techniques exist for building self-autonomous
cloud systems. In comparison to current approaches, this paper proposes a more robust and reliable architecture using
optimal checkpointing strategy to ensure high system availability and reduced system task service finish time. Using
pass rates and virtualised mechanisms, the proposed Smart Failover Strategy (SFS) scheme uses components such as
Cloud fault manager, Cloud controller, Cloud load balancer and a selection mechanism, providing fault tolerance via
redundancy, optimized selection and checkpointing. In our approach, the Cloud fault manager repairs faults generated
before the task time deadline is reached, blocking unrecoverable faulty nodes as well as their virtual nodes. This scheme
is also able to remove temporary software faults from recoverable faulty nodes, thereby making them available for future
request. We argue that the proposed SFS algorithm makes the system highly fault tolerant by considering forward and
backward recovery using diverse software tools. Compared to existing approaches, preliminary experiment of the SFS
algorithm indicate an increase in pass rates and a consequent decrease in failure rates, showing an overall good
performance in task allocations. We present these results using experimental validation tools with comparison to other
techniques, laying a foundation for a fully fault tolerant IaaS Cloud environment
- …