Search CORE

3 research outputs found

Checkpointing Strategies for Scheduling Computational Workflows

Author: Aupy Guillaume
Benoit Anne
Casanova Henri
Robert Yves
Publication venue: Higashi Hiroshima : Dept. of Computer Engineering, Hiroshima University
Publication date: 01/01/2016
Field of study

International audienceWe study the scheduling of computational workflows on compute resources that experience exponentially distributed failures. When a failure occurs, rollback and recovery is used to resume the execution from the last checkpointed state. The scheduling problem is to minimize the expected execution time by deciding in which order to execute the tasks in the workflow and deciding for each task whether to checkpoint it or not after it completes. We give a polynomial-time optimal algorithm for fork DAGs (Directed Acyclic Graphs) and show that the problem is NP-complete with join DAGs. We also investigate the complexity of the simple case in which no task is checkpointed. Our main result is a polynomial-time algorithm to compute the expected execution time of a workflow, with a given task execution order and specified to-be-checkpointed tasks. Using this algorithm as a basis, we propose several heuristics for solving the scheduling problem. We evaluate these heuristics for representative workflow configurations

HAL-ENS-LYON

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Real-time and dynamic fault-tolerant scheduling for scientific workflows in clouds

Author: Chang Victor
Ge Jidong
Hu Haiyang
Hu Hua
Li Chuanyi
Li Zhongjin
Publication venue: 'Elsevier BV'
Publication date: 01/08/2021
Field of study

Cloud computing has become a popular technology for executing scientific workflows. However, with a large number of hosts and virtual machines (VMs) being deployed, the cloud resource failures, such as the permanent failure of hosts (HPF), the transient failure of hosts (HTF), and the transient failure of VMs (VMTF), bring the service reliability problem. Therefore, fault tolerance for time-consuming scientific workflows is highly essential in the cloud. However, existing fault-tolerant (FT) approaches consider only one or two above failure types and easily neglect the others, especially for the HTF. This paper proposes a Real-time and dynamic Fault-tolerant Scheduling (ReadyFS) algorithm for scientific workflow execution in a cloud, which guarantees deadline constraints and improves resource utilization even in the presence of any resource failure. Specifically, we first introduce two FT mechanisms, i.e., the replication with delay execution (RDE) and the checkpointing with delay execution (CDE), to cope with HPF and VMTF, simultaneously. Additionally, the rescheduling (ReSC) is devised to tackle the HTF that affects the resource availability of the entire cloud datacenter. Then, the resource adjustment (RA) strategy, including the resource scaling-up (RS-Up) and the resource scaling-down (RS-Down), is used to adjust resource demands and improve resource utilization dynamically. Finally, the ReadyFS algorithm is presented to schedule real-time scientific workflows by combining all the above FT mechanisms with RA strategy. We conduct the performance evaluation with real-world scientific workflows and compare ReadyFS with five vertical comparison algorithms and three horizontal comparison algorithms. Simulation results confirm that ReadyFS is indeed able to guarantee the fault tolerance of scientific workflow execution and improve cloud resource utilization

Teeside University's Research Repository

Aston Publications Explorer

Checkpointing Strategies for Scheduling Computational Workflows

Author
Publication venue: 'IJNC Editorial Committee'
Publication date: 01/01/2016
Field of study

Crossref