Search CORE

607 research outputs found

A proactive fault tolerance framework for high performance computing (HPC) systems in the cloud

Author: Egwutuoha Ifeanyi Paulinus
Publication venue: Faculty of Engineering and Information Technologies, School of Electrical and Information Engineering
Publication date: 01/01/2014
Field of study

High Performance Computing (HPC) systems have been widely used by scientists and researchers in both industry and university laboratories to solve advanced computation problems. Most advanced computation problems are either data-intensive or computation-intensive. They may take hours, days or even weeks to complete execution. For example, some of the traditional HPC systems computations run on 100,000 processors for weeks. Consequently traditional HPC systems often require huge capital investments. As a result, scientists and researchers sometimes have to wait in long queues to access shared, expensive HPC systems. Cloud computing, on the other hand, offers new computing paradigms, capacity, and flexible solutions for both business and HPC applications. Some of the computation-intensive applications that are usually executed in traditional HPC systems can now be executed in the cloud. Cloud computing price model eliminates huge capital investments. However, even for cloud-based HPC systems, fault tolerance is still an issue of growing concern. The large number of virtual machines and electronic components, as well as software complexity and overall system reliability, availability and serviceability (RAS), are factors with which HPC systems in the cloud must contend. The reactive fault tolerance approach of checkpoint/restart, which is commonly used in HPC systems, does not scale well in the cloud due to resource sharing and distributed systems networks. Hence, the need for reliable fault tolerant HPC systems is even greater in a cloud environment. In this thesis we present a proactive fault tolerance approach to HPC systems in the cloud to reduce the wall-clock execution time, as well as dollar cost, in the presence of hardware failure. We have developed a generic fault tolerance algorithm for HPC systems in the cloud. We have further developed a cost model for executing computation-intensive applications on HPC systems in the cloud. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in the cloud can be considerably reduced compared to checkpoint and redundancy techniques used in traditional HPC systems

Sydney eScholarship

Proactive cloud service assurance framework for fault remediation in cloud environment

Author: Babu B. Sathish
G. Shobha
Shetty Jyoti
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/02/2020
Field of study

Cloud resiliency is an important issue in successful implementation of cloud computing systems. Handling cloud faults proactively, with a suitable remediation technique having minimum cost is an important requirement for a fault management system. The selection of best applicable remediation technique is a decision making problem and considers parameters such as i) Impact of remediation technique ii) Overhead of remediation technique ii) Severity of fault and iv) Priority of the application. This manuscript proposes an analytical model to measure the effectiveness of a remediation technique for various categories of faults, further it demonstrates the implementation of an efficient fault remediation system using a rule-based expert system. The expert system is designed to compute an utility value for each remediation technique in a novel way and select the best remediation technique from its knowledgebase. A prototype is developed for experimentation purpose and the results shows improved availability with less overhead as compared to a reactive fault management system

ZENODO

Institute of Advanced Engineering and Science

On Autonomic HPC Clouds

Author: Petcu Dana
Publication venue
Publication date: 01/01/2015
Field of study

Proceedings of: Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015). Krakow (Poland), September 10-11, 2015.The long tail of science using HPC facilities is looking nowadays to instant available HPC Clouds as a viable alternative to the long waiting queues of supercomputing centers. While the name of HPC Cloud is suggesting a Cloud service, the current HPC-as-a-Service is mainly an offer of bar metal, better named cluster-on-demand. The elasticity and virtualization benefits of the Clouds are not exploited by HPC-as-a-Service. In this paper we discuss how the HPC Cloud offer can be improved from a particular point of view, of automation. After a reminder of the characteristics of the Autonomic Cloud, we project the requirements and expectations to what we name Autonomic HPC Clouds. Finally, we point towards the expected results of the latest research and development activities related to the topics that were identified.The work related to Autonomic HPC Clouds is supported by the European Commission under grant agreement H2020-6643946 (CloudLightning). The CLoudLightning project proposal was prepared by eight partner institutions, three of them as earlier partners in the COST Action IC1305 NESUS, benefiting from its inputs for the proposal. The section related to Autonomic Clouds is supported by the Romanian UEFISCDI under grant agreement PN-II-ID-PCE-2011- 3-0260 (AMICAS)

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Checkpointing as a Service in Heterogeneous Cloud Environments

Author: Cao Jiajun
Cooperman Gene
Morin Christine
Simonin Matthieu
Publication venue
Publication date: 07/11/2014
Field of study

A non-invasive, cloud-agnostic approach is demonstrated for extending existing cloud platforms to include checkpoint-restart capability. Most cloud platforms currently rely on each application to provide its own fault tolerance. A uniform mechanism within the cloud itself serves two purposes: (a) direct support for long-running jobs, which would otherwise require a custom fault-tolerant mechanism for each application; and (b) the administrative capability to manage an over-subscribed cloud by temporarily swapping out jobs when higher priority jobs arrive. An advantage of this uniform approach is that it also supports parallel and distributed computations, over both TCP and InfiniBand, thus allowing traditional HPC applications to take advantage of an existing cloud infrastructure. Additionally, an integrated health-monitoring mechanism detects when long-running jobs either fail or incur exceptionally low performance, perhaps due to resource starvation, and proactively suspends the job. The cloud-agnostic feature is demonstrated by applying the implementation to two very different cloud platforms: Snooze and OpenStack. The use of a cloud-agnostic architecture also enables, for the first time, migration of applications from one cloud platform to another.Comment: 20 pages, 11 figures, appears in CCGrid, 201

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Usability of Scientific Workflow in Dynamically Changing Environment

Author: Bánáti Anna
Kacsuk Péter
Kail Eszter
Kozlovszky Miklós
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Scientific workflow management systems are mainly data-flow oriented, which face several challenges due to the huge amount of data and the required computational capacity which cannot be predicted before enactment. Other problems may arise due to the dynamic access of the data storages or other data sources and the distributed nature of the scientific workflow computational infrastructures (cloud, cluster, grid, HPC), which status may change even during running of a single workflow instance. Many of these failures could be avoided with workflow management systems that provide provenance based dynamism and adaptivity to the unforeseen scenarios arising during enactment. In our work we summarize and categorize the failures that can arise in cloud environment during enactment and show the possibility of prediction and avoidance of failures with dynamic and provenance support

Crossref

SZTAKI Publication Repository

Recommended from our members

Optimising Fault Tolerance in Real-time Cloud Computing IaaS Environment

Author: Awan Irfan U.
Kiran Mariam
Maiyama Kabiru M.
Mohammed Bashir
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/08/2016
Field of study

YesFault tolerance is the ability of a system to respond swiftly to an unexpected failure. Failures in a cloud computing environment are normal rather than exceptional, but fault detection and system recovery in a real time cloud system is a crucial issue. To deal with this problem and to minimize the risk of failure, an optimal fault tolerance mechanism was introduced where fault tolerance was achieved using the combination of the Cloud Master, Compute nodes, Cloud load balancer, Selection mechanism and Cloud Fault handler. In this paper, we proposed an optimized fault tolerance approach where a model is designed to tolerate faults based on the reliability of each compute node (virtual machine) and can be replaced if the performance is not optimal. Preliminary test of our algorithm indicates that the rate of increase in pass rate exceeds the decrease in failure rate and it also considers forward and backward recovery using diverse software tools. Our results obtained are demonstrated through experimental validation thereby laying a foundation for a fully fault tolerant IaaS Cloud environment, which suggests a good performance of our model compared to current existing approaches.Petroleum Technology Development Fund (PTDF

Bradford Scholars

Review and Analysis of Failure Detection and Prevention Techniques in IT Infrastructure Monitoring

Author: Bhanage Deepali Arun
Kotecha K
Pawar Ambika Vishal
Publication venue: DigitalCommons@University of Nebraska - Lincoln
Publication date: 04/03/2021
Field of study

Maintaining the health of IT infrastructure components for improved reliability and availability is a research and innovation topic for many years. Identification and handling of failures are crucial and challenging due to the complexity of IT infrastructure. System logs are the primary source of information to diagnose and fix failures. In this work, we address three essential research dimensions about failures, such as the need for failure handling in IT infrastructure, understanding the contribution of system-generated log in failure detection and reactive & proactive approaches used to deal with failure situations. This study performs a comprehensive analysis of existing literature by considering three prominent aspects as log preprocessing, anomaly & failure detection, and failure prevention. With this coherent review, we (1) presume the need for IT infrastructure monitoring to avoid downtime, (2) examine the three types of approaches for anomaly and failure detection such as a rule-based, correlation method and classification, and (3) fabricate the recommendations for researchers on further research guidelines. As far as the authors\u27 knowledge, this is the first comprehensive literature review on IT infrastructure monitoring techniques. The review has been conducted with the help of meta-analysis and comparative study of machine learning and deep learning techniques. This work aims to outline significant research gaps in the area of IT infrastructure failure detection. This work will help future researchers understand the advantages and limitations of current methods and select an adequate approach to their problem

DigitalCommons@University of Nebraska