An enhanced ant colony system algorithm for dynamic fault tolerance in grid computing

Abstract

Fault tolerance in grid computing allows the system to continue operate despite occurrence of failure. Most fault tolerance algorithms focus on fault handling techniques such as task reprocessing, checkpointing, task replication, penalty, and task migration. Ant colony system (ACS), a variant of ant colony optimization (ACO), is one of the promising algorithms for fault tolerance due to its ability to adapt to both static and dynamic combinatorial optimization problems. However, ACS algorithm does not consider the resource fitness during task scheduling which leads to poor load balancing and lower execution success rate. This research proposes dynamic ACS fault tolerance with suspension (DAFTS) in grid computing that focuses on providing effective fault tolerance techniques to improve the execution success rate and load balancing. The proposed algorithm consists of dynamic evaporation rate, resource fitness-based scheduling process, enhanced pheromone update with trust factor and suspension, and checkpoint-based task reprocessing. The research framework consists of four phases which are identifying fault tolerance techniques, enhancing resource assignment and job scheduling, improving fault tolerance algorithm and, evaluating the performance of the proposed algorithm. The proposed algorithm was developed in a simulated grid environment called GridSim and evaluated against other fault tolerance algorithms such as trust-based ACO, fault tolerance ACO, ACO without fault tolerance and ACO with fault tolerance in terms of total execution time, average latency, average makespan, throughput, execution success rate and load balancing. Experimental results showed that the proposed algorithm achieved the best performance in most aspects, and second best in terms of load balancing. The DAFTS achieved the smallest increase on execution time, average makespan and average latency by 7%, 11% and 5% respectively, and smallest decrease on throughput and execution success rate by 6.49% and 9% respectively as the failure rate increases. The DAFTS also achieved the smallest increment on execution time, average makespan and average latency by 5.8, 8.5 and 8.7 times respectively, and highest increase on throughput and highest execution success rate by 72.9% and 93.7% respectively as the number of jobs increases. The proposed algorithm can effectively overcome load balancing problems and increase execution success rates in distributed systems that are prone to faults

    Similar works