3 research outputs found

    On the resilience of parallel sparse hybrid solvers

    Get PDF
    International audienceAs the computational power of high performance computing (HPC) systems continues to increase by using a huge number of CPU cores or specialized processing units, extreme-scale applications are increasingly prone to faults. Consequently, the HPC community has proposed many contributions to design resilient HPC applications. These contributions may be system-oriented, theoretical or numerical. In this study we consider an actual fully-featured parallel sparse hybrid (direct/iterative) linear solver, MAPHYS, and we propose numerical remedies to design a resilient version of the solver. The solver being hybrid, we focus in this study on the iterative solution step, which is often the dominant step in practice. We furthermore assume that a separate mechanism ensures fault detection and that a system layer provides support for setting back the environment (processes,. . .) in a running state. The present manuscript therefore focuses on (and only on) strategies for recovering lost data after the fault has been detected (a separate concern beyond the scope of this study), once the system is restored (another separate concern not studied here). The numerical remedies we propose are twofold. Whenever possible, we exploit the natural data redundancy between processes from the solver to perform exact recovery through clever copies over processes. Otherwise, data that has been lost and no longer available on any process is recovered through a so-called interpolation-restart mechanism. This mechanism is derived from [1] by carefully taking into account the properties of the target hybrid solver. These numerical remedies have been implemented in the MAPHYS parallel solver so that we can assess their efficiency on a large number of processing units (up to 12, 288 CPU cores) for solving large-scale real-life problems

    Interpolation-restart strategies for resilient eigensolvers

    Get PDF
    International audienceThe solution of large eigenproblems is involved in many scientific and engineering applications when for instance, stability analysis is a concern. For large simulation in material physics or thermo-acoustics, the calculation can last for many hours on large parallel platforms. On future large-scale systems, the mean time between failures (MTBF) of the system is expected to decrease so that many faults could occur during the solution of large eigenproblems. Consequently, it becomes critical to design parallel eigensolvers that can survive faults. In that framework, we investigate the relevance of approaches relying on numerical techniques, which might be combined with more classical techniques for real large-scale parallel implementations. Because we focus on numerical remedies we do not consider parallel implementations nor parallel experiments but only numerical experiments. We assume that a separate mechanism ensures the fault detection and that a system layer provides support for setting back the environment (processes,. . .) in a running state. Once the system is in a running state, after a fault, our main objective is to provide robust resilient schemes so that the eigensolver may keep converging in the presence of the fault without restarting the calculation from scratch. For this purpose, we extend the interpolation-restart (IR) strategies initially introduced for the solution of linear systems in a previous work to the solution of eigenproblems in this paper. For a given numerical scheme, the IR strategies consist of extracting relevant spectral information from available data after a fault. After data extraction, a well-selected part of the missing data is regenerated through interpolation strategies to constitute a meaningful input to restart the numerical algorithm. One of the main features of this numerical remedy is that it does not require extra resources, i.e., computational unit or computing time, when no fault occurs. In this paper, we revisit a few state-of-the-art methods for solving large sparse eigenvalue problems namely the Arnoldi methods, subspace iteration methods and the Jacobi-Davidson method, in the light of our IR strategies. For each considered eigensolver, we adapt the IR strategies to regenerate as much spectral information as possible. Through extensive numerical experiments, we study the respective robustness of the resulting resilient schemes with respect to the MTBF and to the amount of data loss via qualitative and quantitative illustrations. 1. Introduction. The computation of eigenpairs (eigenvalues and eigenvectors) of large sparse matrices is involved in many scientific and engineering applications such as when stability analysis is a concern. To name a few, it appears in structural dynamics, thermodynamics, thermo-acoustics, quantum chemistry. With the permanent increase of the computational power of high performance computing (HPC) systems by using a larger and larger number of CPU cores or specialized processing units, HPC applications are increasingly prone to faults. To guarantee fault tolerance, two classes of strategies are required. One for the fault detection and the other for fault correction. Faults such as computational node crashes are obvious to detect while silent faults may be challenging to detect. To cope with silent faults, a duplication strategy is commonly used for fault detection [18, 39] by comparing the outputs, while triple modular redundancy (TMR) is used for fault detection and correction [34, 37]. However, the additional computational resources required by such replication strategies may represent a severe penalty. Instead of replicating computational resources, studies [7, 36] propose a time redundancy model for fault detection. It consists in repeating computation twice on the same resource. The advantage of time redundancy models is the flexibility at application level; software developers can indeed select only a set of critical instructions to protect. Recomputing only some instructions instead of the whole application lowers the time redundancy overhead [25]. In some numerical simulations, data naturally satisfy well defined mathematical properties. These properties can be efficiently exploited for fault detection through a periodical check of the numerical properties during computation [10]. Checkpoint/restart is the most studied fault recovery strategy in the context of HPC systems. The common checkpoint/restart scheme consists in periodically saving data onto a reliable storage device such as a remote disk. When a fault occurs, a rollback is performed to the point of the most recent and consistent checkpoint. According to the implemented checkpoint strategy, all processe

    Exploiting task-based programming models for resilience

    Get PDF
    Hardware errors become more common as silicon technologies shrink and become more vulnerable, especially in memory cells, which are the most exposed to errors. Permanent and intermittent faults are caused by manufacturing variability and circuits ageing. While these can be mitigated once they are identified, their continuous rate of appearance throughout the lifetime of memory devices will always cause unexpected errors. In addition, transient faults are caused by effects such as radiation or small voltage/frequency margins, and there is no efficient way to shield against these events. Other constraints related to the diminishing sizes of transistors, such as power consumption and memory latency have caused the microprocessor industry to turn to increasingly complex processor architectures. To solve the difficulties arising from programming such architectures, programming models have emerged that rely on runtime systems. These systems form a new intermediate layer on the hardware-software abstraction stack, that performs tasks such as distributing work across computing resources: processor cores, accelerators, etc. These runtime systems dispose of a lot of information, both from the hardware and the applications, and offer thus many possibilities for optimisations. This thesis proposes solutions to the increasing fault rates in memory, across multiple resilience disciplines, from algorithm-based fault tolerance to hardware error correcting codes, through OS reliability strategies. These solutions rely for their efficiency on the opportunities presented by runtime systems. The first contribution of this thesis is an algorithmic-based resilience technique, allowing to tolerate detected errors in memory. This technique allows to recover data that is lost by performing computations that rely on simple redundancy relations identified in the program. The recovery is demonstrated for a family of iterative solvers, the Krylov subspace methods, and evaluated for the conjugate gradient solver. The runtime can transparently overlap the recovery with the computations of the algorithm, which allows to mask the already low overheads of this technique. The second part of this thesis proposes a metric to characterise the impact of faults in memory, which outperforms state-of-the-art metrics in precision and assurances on the error rate. This metric reveals a key insight into data that is not relevant to the program, and we propose an OS-level strategy to ignore errors in such data, by delaying the reporting of detected errors. This allows to reduce failure rates of running programs, by ignoring errors that have no impact. The architectural-level contribution of this thesis is a dynamically adaptable Error Correcting Code (ECC) scheme, that can increase protection of memory regions where the impact of errors is highest. A runtime methodology is presented to estimate the fault rate at runtime using our metric, through performance monitoring tools of current commodity processors. Guiding the dynamic ECC scheme online using the methodology's vulnerability estimates allows to decrease error rates of programs at a fraction of the redundancy cost required for a uniformly stronger ECC. This provides a useful and wide range of trade-offs between redundancy and error rates. The work presented in this thesis demonstrates that runtime systems allow to make the most of redundancy stored in memory, to help tackle increasing error rates in DRAM. This exploited redundancy can be an inherent part of algorithms that allows to tolerate higher fault rates, or in the form of dead data stored in memory. Redundancy can also be added to a program, in the form of ECC. In all cases, the runtime allows to decrease failure rates efficiently, by diminishing recovery costs, identifying redundant data, or targeting critical data. It is thus a very valuable tool for the future computing systems, as it can perform optimisations across different layers of abstractions.Los errores en memoria se vuelven m谩s comunes a medida que las tecnolog铆as de silicio reducen su tama帽o. La variabilidad de fabricaci贸n y el envejecimiento de los circuitos causan fallos permanentes e intermitentes. Aunque se pueden mitigar una vez identificados, su continua tasa de aparici贸n siempre causa errores inesperados. Adem谩s, la memoria tambi茅n sufre de fallos transitorios contra los cuales no se puede proteger eficientemente. Estos fallos est谩n causados por efectos como la radiaci贸n o los reducidos m谩rgenes de voltaje y frecuencia. Otras restricciones coet谩neas, como el consumo de energ铆a y la latencia de la memoria, obligaron a las arquitecturas de computadores a volverse cada vez m谩s complejas. Para programar tales procesadores, se desarrollaron modelos de programaci贸n basados en entornos de ejecuci贸n. Estos sistemas forman una nueva abstracci贸n entre hardware y software, realizando tareas como la distribuci贸n del trabajo entre recursos inform谩ticos: n煤cleos de procesadores, aceleradores, etc. Estos entornos de ejecuci贸n disponen de mucha informaci贸n tanto sobre el hardware como sobre las aplicaciones, y ofrecen as铆 muchas posibilidades de optimizaci贸n. Esta tesis propone soluciones a los fallos en memoria entre m煤ltiples disciplinas de resiliencia, desde la tolerancia a fallos basada en algoritmos, hasta los c贸digos de correcci贸n de errores en hardware, incluyendo estrategias de resiliencia del sistema operativo. La eficiencia de estas soluciones depende de las oportunidades que presentan los entornos de ejecuci贸n. La primera contribuci贸n de esta tesis es una t茅cnica a nivel algor铆tmico que permite corregir fallos encontrados mientras el programa su ejecuta. Para corregir fallos se han identificado redundancias simples en los datos del programa para toda una clase de algoritmos, los m茅todos del subespacio de Krylov (gradiente conjugado, GMRES, etc). La estrategia de recuperaci贸n de datos desarrollada permite corregir errores sin tener que reinicializar el algoritmo, y aprovecha el modelo de programaci贸n para superponer las computaciones del algoritmo y de la recuperaci贸n de datos. La segunda parte de esta tesis propone una m茅trica para caracterizar el impacto de los fallos en la memoria. Esta m茅trica supera en precisi贸n a las m茅tricas de vanguardia y permite identificar datos que son menos relevantes para el programa. Se propone una estrategia a nivel del sistema operativo retrasando la notificaci贸n de los errores detectados, que permite ignorar fallos en estos datos y reducir la tasa de fracaso del programa. Por 煤ltimo, la contribuci贸n a nivel arquitect贸nico de esta tesis es un esquema de C贸digo de Correcci贸n de Errores (ECC por sus siglas en ingl茅s) adaptable din谩micamente. Este esquema puede aumentar la protecci贸n de las regiones de memoria donde el impacto de los errores es mayor. Se presenta una metodolog铆a para estimar el riesgo de fallo en tiempo de ejecuci贸n utilizando nuestra m茅trica, a trav茅s de las herramientas de monitorizaci贸n del rendimiento disponibles en los procesadores actuales. El esquema de ECC guiado din谩micamente con estas estimaciones de vulnerabilidad permite disminuir la tasa de fracaso de los programas a una fracci贸n del coste de redundancia requerido para un ECC uniformemente m谩s fuerte. El trabajo presentado en esta tesis demuestra que los entornos de ejecuci贸n permiten aprovechar al m谩ximo la redundancia contenida en la memoria, para contener el aumento de los errores en ella. Esta redundancia explotada puede ser una parte inherente de los algoritmos que permite tolerar m谩s fallos, en forma de datos inutilizados almacenados en la memoria, o agregada a la memoria de un programa en forma de ECC. En todos los casos, el entorno de ejecuci贸n permite disminuir los efectos de los fallos de manera eficiente, disminuyendo los costes de recuperaci贸n, identificando datos redundantes, o focalizando esfuerzos de protecci贸n en los datos cr铆ticos
    corecore