Search CORE

1,054 research outputs found

Independent Checkpointing in a Heterogeneous Grid Environment

Author: Feller Eugen
Mehnert-Spahn John
Morin Christine
Schoettner Michael
Publication venue: HAL CCSD
Publication date: 27/09/2010
Field of study

The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support different checkpointing protocols and to address the underlying grid-node checkpointers (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform interface. In this paper, we present the integration of an independent checkpointing and rollback-recovery protocol into the XtreemGCP. The solution we propose is not checkpointer bound and thus can be transparently used on top of any grid-node checkpointer. To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability.Le projet XtreemOS financé par l'Union Européenne met en oeuvre un système d'exploitation open-source pour grille basé sur Linux. Afin d'offrir tolérance aux fautes et migration d'applications pour grilles, il intéragit avec un service distribué de sauvegarde de points de reprise de processus appelé XtreemGCP. Ce service est conçu pour supporter différents protocoles de sauvegarde de points de reprise de processus et pour s'interfacer avec les systèmes de sauvegarde de points de reprise sous-jacents (par exemple BLCR, LinuxSSI, OpenVZ, etc.) de manière transparente à travers une interface uniforme. Dans cet article, nous présentons l'intégration d'un protocole indépendant de sauvegarde de points de reprise et de retour arrière dans XtreemGCP. La solution que nous proposons n'est pas limitée par le système de sauvegarde de points de reprise et peut ainsi être utilisée de façon transparente au-dessus de n'importe lequel. Nous évaluons ce prototype en l'exécutant dans un environnement hétérogène composé de simples noeuds PC et d'une grappe basée sur un système à image unique (SSI). Les résultats expérimentaux démontrent la capacité du service XtreemGCP à intégrer les différents protocoles de sauvegarde de points de reprise et à sauvegarder de manière indépendante un point de reprise d'une application distribuée s'exécutant sur un environnement de grille hétérogène. De plus, les évaluations de performance montrent que notre solution surpasse les protocoles coordonnés existants en terme de passage à l'échelle

INRIA a CCSD electronic archive server

Checkpointing as a Service in Heterogeneous Cloud Environments

Author: Cao Jiajun
Cooperman Gene
Morin Christine
Simonin Matthieu
Publication venue
Publication date: 07/11/2014
Field of study

A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

An Extensible Timing Infrastructure for Adaptive Large-scale Applications

Author: Allen Gabrielle
Goodale Tom
Radke Thomas
Schnetter Erik
Stark Dylan
Publication venue
Publication date: 01/01/2007
Field of study

Real-time access to accurate and reliable timing information is necessary to profile scientific applications, and crucial as simulations become increasingly complex, adaptive, and large-scale. The Cactus Framework provides flexible and extensible capabilities for timing information through a well designed infrastructure and timing API. Applications built with Cactus automatically gain access to built-in timers, such as gettimeofday and getrusage, system-specific hardware clocks, and high-level interfaces such as PAPI. We describe the Cactus timer interface, its motivation, and its implementation. We then demonstrate how this timing information can be used by an example scientific application to profile itself, and to dynamically adapt itself to a changing environment at run time

arXiv.org e-Print Archive

MPG.PuRe

Distributed Preemptive Process Management With Checkpointing And Migration For A Linux-Based Grid Operating System

Author: Htin Paw Oo @ Nur Hussein
Publication venue
Publication date: 01/06/2006
Field of study

Kemunculan perkomputeran grid telah membolehkan perkongsian sumber perkomputeran teragih antara peserta-peserta organisasi maya. Walau bagaimanapun, sistem pengoperasian kini tidak memberi sokongan paras rendah secukupnya untuk perlaksanaan perisian grid. Kemunculan suatu kelas sistem pengoperasian yang dipanggil sistem pengoperasian grid memberikan pengabstrakan peringkat sistem untuk sumber-sumber grid The advent of grid computing has enabled distributed computing resources to be shared amongst participants of virtual organisations. However, current operating systems do not adequately provide enough low-level facilities to accommodate grid software. There is an emerging class of operating systems called grid operating systems which provide systemslevel abstractions for grid resources

Repository@USM