Programmer-directed partial redundancy for resilient HPC

Abstract

In this work we propose partial task replication and check-pointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.Peer ReviewedPostprint (published version

Similar works

Full text

thumbnail-image

UPCommons. Portal del coneixement obert de la UPC

redirect
Last time updated on 01/05/2017

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.