3,706 research outputs found
Havens: Explicit Reliable Memory Regions for HPC Applications
Supporting error resilience in future exascale-class supercomputing systems
is a critical challenge. Due to transistor scaling trends and increasing memory
density, scientific simulations are expected to experience more interruptions
caused by transient errors in the system memory. Existing hardware-based
detection and recovery techniques will be inadequate to manage the presence of
high memory fault rates.
In this paper we propose a partial memory protection scheme based on
region-based memory management. We define the concept of regions called havens
that provide fault protection for program objects. We provide reliability for
the regions through a software-based parity protection mechanism. Our approach
enables critical program objects to be placed in these havens. The fault
coverage provided by our approach is application agnostic, unlike
algorithm-based fault tolerance techniques.Comment: 2016 IEEE High Performance Extreme Computing Conference (HPEC '16),
September 2016, Waltham, MA, US
Memory Vulnerability: A Case for Delaying Error Reporting
To face future reliability challenges, it is necessary to quantify the risk
of error in any part of a computing system. To this goal, the Architectural
Vulnerability Factor (AVF) has long been used for chips. However, this metric
is used for offline characterisation, which is inappropriate for memory. We
survey the literature and formalise one of the metrics used, the Memory
Vulnerability Factor, and extend it to take into account false errors. These
are reported errors which would have no impact on the program if they were
ignored. We measure the False Error Aware MVF (FEA) and related metrics
precisely in a cycle-accurate simulator, and compare them with the effects of
injecting faults in a program's data, in native parallel runs. Our findings
show that MVF and FEA are the only two metrics that are safe to use at runtime,
as they both consistently give an upper bound on the probability of incorrect
program outcome. FEA gives a tighter bound than MVF, and is the metric that
correlates best with the incorrect outcome probability of all considered
metrics
- …