Search CORE

47,425 research outputs found

Checkpointing as a Service in Heterogeneous Cloud Environments

Author: Cao Jiajun
Cooperman Gene
Morin Christine
Simonin Matthieu
Publication venue
Publication date: 07/11/2014
Field of study

A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

HTC Scientific Computing in a Distributed Cloud Environment

Author: Agarwal A.
Charbonneau A.
Gable I.
Impey R.
Leavett-Brown C.
Paterson M.
Podiama W.
Sobie R.
Taylor R.
Publication venue
Publication date: 01/01/2013
Field of study

This paper describes the use of a distributed cloud computing system for high-throughput computing (HTC) scientific applications. The distributed cloud computing system is composed of a number of separate Infrastructure-as-a-Service (IaaS) clouds that are utilized in a unified infrastructure. The distributed cloud has been in production-quality operation for two years with approximately 500,000 completed jobs where a typical workload has 500 simultaneous embarrassingly-parallel jobs that run for approximately 12 hours. We review the design and implementation of the system which is based on pre-existing components and a number of custom components. We discuss the operation of the system, and describe our plans for the expansion to more sites and increased computing capacity

arXiv.org e-Print Archive

CiteSeerX

NRC Publications Archive

Crossref

High-Performance Cloud Computing: A View of Scientific Applications

Author: Buyya Rajkumar
Pandey Suraj
Vecchiola Christian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Scientific computing often requires the availability of a massive number of computers for performing large scale experiments. Traditionally, these needs have been addressed by using high-performance computing solutions and installed facilities such as clusters and super computers, which are difficult to setup, maintain, and operate. Cloud computing provides scientists with a completely new model of utilizing the computing infrastructure. Compute resources, storage resources, as well as applications, can be dynamically provisioned (and integrated within the existing infrastructure) on a pay per use basis. These resources can be released when they are no more needed. Such services are often offered within the context of a Service Level Agreement (SLA), which ensure the desired Quality of Service (QoS). Aneka, an enterprise Cloud computing solution, harnesses the power of compute resources by relying on private and public Clouds and delivers to users the desired QoS. Its flexible and service based infrastructure supports multiple programming paradigms that make Aneka address a variety of different scenarios: from finance applications to computational science. As examples of scientific computing in the Cloud, we present a preliminary case study on using Aneka for the classification of gene expression data and the execution of fMRI brain imaging workflow.Comment: 13 pages, 9 figures, conference pape

arXiv.org e-Print Archive

CiteSeerX

Crossref

SLA-Oriented Resource Provisioning for Cloud Computing: Challenges, Architecture, and Solutions

Author: Buyya Rajkumar
Calheiros Rodrigo N.
Garg Saurabh Kumar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

Cloud computing systems promise to offer subscription-oriented, enterprise-quality computing services to users worldwide. With the increased demand for delivering services to a large number of users, they need to offer differentiated services to users and meet their quality expectations. Existing resource management systems in data centers are yet to support Service Level Agreement (SLA)-oriented resource allocation, and thus need to be enhanced to realize cloud computing and utility computing. In addition, no work has been done to collectively incorporate customer-driven service management, computational risk management, and autonomic resource management into a market-based resource management system to target the rapidly changing enterprise requirements of Cloud computing. This paper presents vision, challenges, and architectural elements of SLA-oriented resource management. The proposed architecture supports integration of marketbased provisioning policies and virtualisation technologies for flexible allocation of resources to applications. The performance results obtained from our working prototype system shows the feasibility and effectiveness of SLA-based resource provisioning in Clouds.Comment: 10 pages, 7 figures, Conference Keynote Paper: 2011 IEEE International Conference on Cloud and Service Computing (CSC 2011, IEEE Press, USA), Hong Kong, China, December 12-14, 201

arXiv.org e-Print Archive

Crossref

Western Sydney ResearchDirect

A support architecture for reliable distributed computing systems

Author: Mckendry Martin S.
Publication venue
Publication date
Field of study

The Clouds kernel design was through several design phases and is nearly complete. The object manager, the process manager, the storage manager, the communications manager, and the actions manager are examined

NASA Technical Reports Server