Search CORE

1,093 research outputs found

HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges

Author: Buyya Rajkumar
Calheiros Rodrigo N.
Cunha Renato L. F.
Netto Marco A. S.
Rodrigues Eduardo R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

High Performance Computing (HPC) clouds are becoming an alternative to on-premise clusters for executing scientific applications and business analytics services. Most research efforts in HPC cloud aim to understand the cost-benefit of moving resource-intensive applications from on-premise environments to public cloud platforms. Industry trends show hybrid environments are the natural path to get the best of the on-premise and cloud resources---steady (and sensitive) workloads can run on on-premise resources and peak demand can leverage remote resources in a pay-as-you-go manner. Nevertheless, there are plenty of questions to be answered in HPC cloud, which range from how to extract the best performance of an unknown underlying platform to what services are essential to make its usage easier. Moreover, the discussion on the right pricing and contractual models to fit small and large users is relevant for the sustainability of HPC clouds. This paper brings a survey and taxonomy of efforts in HPC cloud and a vision on what we believe is ahead of us, including a set of research challenges that, once tackled, can help advance businesses and scientific discoveries. This becomes particularly relevant due to the fast increasing wave of new HPC applications coming from big data and artificial intelligence.Comment: 29 pages, 5 figures, Published in ACM Computing Surveys (CSUR

arXiv.org e-Print Archive

Western Sydney ResearchDirect

On Energy-efficient Checkpointing in High-throughput Cycle-stealing Distributed Systems

Author: Donnellan Brian
Forshaw Matthew
Helfert Markus
Krempels Karl-Heinz
McGough A. Stephen
Thomas Nigel
Publication venue
Publication date: 01/01/2014
Field of study

Checkpointing is a fault-tolerance mechanism commonly used in High Throughput Computing (HTC) environments to allow the execution of long-running computational tasks on compute resources subject to hardware and software failures and interruptions from resource owners. With increasing scrutiny of the energy consumption of IT infrastructures, it is important to understand the impact of checkpointing on the energy consumption of HTC environments. In this paper we demonstrate through trace-driven simulation on real-world datasets that existing checkpointing strategies are inadequate at maintaining an acceptable level of energy consumption whilst reducing the makespan of tasks. Furthermore, we identify factors important in deciding whether to employ checkpointing within an HTC environment, and propose novel strategies to curtail the energy consumption of checkpointing approaches

Durham Research Online

Running user-provided virtual machines in batch-oriented computing clusters

Author: Oliveira Vítor
Pina António Manuel Silva
Rocha André
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

The use of virtualization in HPC clusters can provide rich software environments, application isolation and efficient workload management mechanisms, but system-level virtualization introduces a software layer on the computing nodes that reduces performance and inhibits the direct use of hardware devices. We propose an unobtrusive user-level platform that allows the execution of virtual machines inside batch jobs without limiting the computing cluster’s ability to execute the most demanding applications. A per-user platform uses a static mode in which the VMs run entirely using the resources of a single batch job and a dynamic mode in which the VMs navigate at runtime between the continuously allocated jobs node time-slots. A dynamic mode is introduced to build complex scenarios with several VMs for personalized HPC environments or persistent services such as databases or web services based applications. Fault-tolerant system agents, integrated using group communication primitives, control the system and execute user commands and automatic scheduling decisions made by an optional monitoring function. The performance of compute intensive applications running on our system suffers negligible overhead compared to the native configuration. The performance of distributed applications is dependent on their communication patterns as the user-mode network overlay introduces a relevant communication overhead.FC

Universidade do Minho: RepositoriUM

Crossref

Mosix the Cluster Operating System Having Advancements & Many Features

Author: Vijendra Rajendra Augustine , Prof. P. L. Ramteke
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/03/2014
Field of study

Mosix is a running of modifications to the Linux kernel. MOSIX Design Objectives turn a network of Linux computers into a High Performance Cluster computer. The Founder o f MOSIX is the Amnon Barak. MOSIX is a cluster operating system that provides users and applications with the impression of running on a single computer with multiple processors which is called as single - system image and Hide cluster complexity to users. T his paper describes the enhancement of MOSIX to openMosix and its cloud environment. There are many advance features of MOSIX by which large number of appli cation work fastly and properly. Balancing Load is the most effective feature we mentioned it in thi s paper

International Journal on Recent and Innovation Trends in Computing and Communication

SIMDAT

Author: Boniface M.J.
Upstill C.
Publication venue
Publication date: 01/11/2005
Field of study

Southampton (e-Prints Soton)

09191 Abstracts Collection -- Fault Tolerance in High-Performance Computing and Grids

Author: Capello Franck
Kale Laxmikant
Mueller Frank
Pingali Keshav
Reinefeld Alexander
Publication venue: Dagstuhl Seminar Proceedings. 09191 - Fault Tolerance in High-Performance Computing and Grids
Publication date: 01/01/2009
Field of study

From June 4--8, 2009, the Dagstuhl Seminar 09191 ``Fault Tolerance in High-Performance Computing and Grids \u27\u27 was held in Schloss Dagstuhl~--~Leibniz Center for Informatics. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available. Slides of the talks and abstracts are available online at url{http://www.dagstuhl.de/Materials/index.en.phtml?09191}

Dagstuhl Research Online Publication Server