1,054 research outputs found
Independent Checkpointing in a Heterogeneous Grid Environment
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support different checkpointing protocols and to address the underlying grid-node checkpointers (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform interface. In this paper, we present the integration of an independent checkpointing and rollback-recovery protocol into the XtreemGCP. The solution we propose is not checkpointer bound and thus can be transparently used on top of any grid-node checkpointer. To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability.Le projet XtreemOS financé par l'Union Européenne met en oeuvre un système d'exploitation open-source pour grille basé sur Linux. Afin d'offrir tolérance aux fautes et migration d'applications pour grilles, il intéragit avec un service distribué de sauvegarde de points de reprise de processus appelé XtreemGCP. Ce service est conçu pour supporter différents protocoles de sauvegarde de points de reprise de processus et pour s'interfacer avec les systèmes de sauvegarde de points de reprise sous-jacents (par exemple BLCR, LinuxSSI, OpenVZ, etc.) de manière transparente à travers une interface uniforme. Dans cet article, nous présentons l'intégration d'un protocole indépendant de sauvegarde de points de reprise et de retour arrière dans XtreemGCP. La solution que nous proposons n'est pas limitée par le système de sauvegarde de points de reprise et peut ainsi être utilisée de façon transparente au-dessus de n'importe lequel. Nous évaluons ce prototype en l'exécutant dans un environnement hétérogène composé de simples noeuds PC et d'une grappe basée sur un système à image unique (SSI). Les résultats expérimentaux démontrent la capacité du service XtreemGCP à intégrer les différents protocoles de sauvegarde de points de reprise et à sauvegarder de manière indépendante un point de reprise d'une application distribuée s'exécutant sur un environnement de grille hétérogène. De plus, les évaluations de performance montrent que notre solution surpasse les protocoles coordonnés existants en terme de passage à l'échelle
Checkpointing as a Service in Heterogeneous Cloud Environments
A non-invasive, cloud-agnostic approach is demonstrated for extending
existing cloud platforms to include checkpoint-restart capability. Most cloud
platforms currently rely on each application to provide its own fault
tolerance. A uniform mechanism within the cloud itself serves two purposes: (a)
direct support for long-running jobs, which would otherwise require a custom
fault-tolerant mechanism for each application; and (b) the administrative
capability to manage an over-subscribed cloud by temporarily swapping out jobs
when higher priority jobs arrive. An advantage of this uniform approach is that
it also supports parallel and distributed computations, over both TCP and
InfiniBand, thus allowing traditional HPC applications to take advantage of an
existing cloud infrastructure. Additionally, an integrated health-monitoring
mechanism detects when long-running jobs either fail or incur exceptionally low
performance, perhaps due to resource starvation, and proactively suspends the
job. The cloud-agnostic feature is demonstrated by applying the implementation
to two very different cloud platforms: Snooze and OpenStack. The use of a
cloud-agnostic architecture also enables, for the first time, migration of
applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201
An Extensible Timing Infrastructure for Adaptive Large-scale Applications
Real-time access to accurate and reliable timing information is necessary to
profile scientific applications, and crucial as simulations become increasingly
complex, adaptive, and large-scale. The Cactus Framework provides flexible and
extensible capabilities for timing information through a well designed
infrastructure and timing API. Applications built with Cactus automatically
gain access to built-in timers, such as gettimeofday and getrusage,
system-specific hardware clocks, and high-level interfaces such as PAPI. We
describe the Cactus timer interface, its motivation, and its implementation. We
then demonstrate how this timing information can be used by an example
scientific application to profile itself, and to dynamically adapt itself to a
changing environment at run time
Distributed Preemptive Process Management With Checkpointing And Migration For A Linux-Based Grid Operating System
Kemunculan perkomputeran grid telah membolehkan perkongsian sumber perkomputeran
teragih antara peserta-peserta organisasi maya. Walau bagaimanapun, sistem pengoperasian
kini tidak memberi sokongan paras rendah secukupnya untuk perlaksanaan perisian
grid. Kemunculan suatu kelas sistem pengoperasian yang dipanggil sistem pengoperasian
grid memberikan pengabstrakan peringkat sistem untuk sumber-sumber grid
The advent of grid computing has enabled distributed computing resources to be shared
amongst participants of virtual organisations. However, current operating systems do not
adequately provide enough low-level facilities to accommodate grid software. There is an
emerging class of operating systems called grid operating systems which provide systemslevel
abstractions for grid resources
- …