conference paper
Programmer-directed partial redundancy for resilient HPC
Abstract
In this work we propose partial task replication and check-pointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.Peer ReviewedPostprint (published version- Conference report
- Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors
- Computer architecture
- Fault-tolerant computing
- Computer programming
- Computer science
- Application tasks
- Check pointing
- Resource costs
- Selective replication
- Silent data corruption (SDC)
- Task parallel
- Task replications
- Arquitectura d'ordinadors
- Tolerància als errors (Informàtica)