Search CORE

3,313 research outputs found

Two-Level Checkpointing and Verifications for Linear Task Graphs

Author: Benoit Anne
Cavelan Aurélien
Robert Yves
Sun Hongyang
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/11/2015
Field of study

International audienceFail-stop and silent errors are omnipresent on large-scale platforms. Efficient resilience techniques must accommodate both error sources. To cope with the double challenge, a two-level checkpointing and rollback recovery approach can be used, with additional verifications for silent error detection. A fail-stop error leads to the loss of the whole memory content, hence the obligation to checkpoint on a stable storage (e.g., an external disk). On the contrary, it is possible to use in-memory checkpoints for silent errors, which provide a much smaller checkpointing and recovery overhead. Furthermore, recent detectors offer partial verification mechanisms that are less costly than the guaranteed ones but do not detect all silent errors. In this paper, we show how to combine all of these techniques for HPC applications whose dependency graph forms a linear chain. We present a sophisticated dynamic programming algorithm that returns the optimal solution in polynomial time. Simulation results demonstrate that the combined use of multi-level checkpointing and verifications leads to improved performance compared to the standard single-level checkpointing algorithm

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Fine-Grain Checkpointing with In-Cache-Line Logging

Author: Aksun David T.
Avni Hillel
Cohen Nachshon
Larus James R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 02/02/2019
Field of study

Non-Volatile Memory offers the possibility of implementing high-performance, durable data structures. However, achieving performance comparable to well-designed data structures in non-persistent (transient) memory is difficult, primarily because of the cost of ensuring the order in which memory writes reach NVM. Often, this requires flushing data to NVM and waiting a full memory round-trip time. In this paper, we introduce two new techniques: Fine-Grained Checkpointing, which ensures a consistent, quickly recoverable data structure in NVM after a system failure, and In-Cache-Line Logging, an undo-logging technique that enables recovery of earlier state without requiring cache-line flushes in the normal case. We implemented these techniques in the Masstree data structure, making it persistent and demonstrating the ease of applying them to a highly optimized system and their low (5.9-15.4\%) runtime overhead cost.Comment: In 2019 Architectural Support for Programming Languages and Operating Systems (ASPLOS 19), April 13, 2019, Providence, RI, US

arXiv.org e-Print Archive

Crossref

Optimizing memory management for optimistic simulation with reinforcement learning

Author: PELLEGRINI ALESSANDRO
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

Simulation is a powerful technique to explore complex scenarios and analyze systems related to a wide range of disciplines. To allow for an efficient exploitation of the available computing power, speculative Time Warp-based Parallel Discrete Event Simulation is universally recognized as a viable solution. In this context, the rollback operation is a fundamental building block to support a correct execution even when causality inconsistencies are a posteriori materialized. If this operation is supported via checkpoint/restore strategies, memory management plays a fundamental role to ensure high performance of the simulation run. With few exceptions, adaptive protocols targeting memory management for Time Warp-based simulations have been mostly based on a pre-defined analytic models of the system, expressed as a closed-form functions that map system's state to control parameters. The underlying assumption is that the model itself is optimal. In this paper, we present an approach that exploits reinforcement learning techniques. Rather than assuming an optimal control strategy, we seek to find the optimal strategy through parameter exploration. A value function that captures the history of system feedback is used, and no a-priori knowledge of the system is required. An experimental assessment of the viability of our proposal is also provided for a mobile cellular system simulation

ART

Archivio della ricerca- Università di Roma La Sapienza

Algorithmic Based Fault Tolerance Applied to High Performance Computing

Author: Bosilca George
Delmas Remi
Dongarra Jack
Langou Julien
Publication venue
Publication date: 01/01/2008
Field of study

We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly

arXiv.org e-Print Archive

CiteSeerX

MIMS EPrints

The University of Manchester - Institutional Repository

NSL-BLRL: Efficient Cache Warmup for Sampled Processor Simulation

Author: De Bosschere Koen
Eeckhout Lieven
Hellebaut Filip
Van Ertvelde Luk
Publication venue: IEEE Computer Society
Publication date: 01/01/2006
Field of study

Ghent University Academic Bibliography

Kilo-instruction processors: overcoming the memory wall

Author: Cazorla Francisco
Cristal Kestelman Adrián
Galluzzi Marco
Pericas Miquel
Ramirez Garcia Tanausú
Santana Jaria Oliverio J.
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Historically, advances in integrated circuit technology have driven improvements in processor microarchitecture and led to todays microprocessors with sophisticated pipelines operating at very high clock frequencies. However, performance improvements achievable by high-frequency microprocessors have become seriously limited by main-memory access latencies because main-memory speeds have improved at a much slower pace than microprocessor speeds. Its crucial to deal with this performance disparity, commonly known as the memory wall, to enable future high-frequency microprocessors to achieve their performance potential. To overcome the memory wall, we propose kilo-instruction processors-superscalar processors that can maintain a thousand or more simultaneous in-flight instructions. Doing so means designing key hardware structures so that the processor can satisfy the high resource requirements without significantly decreasing processor efficiency or increasing energy consumption.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Sensornet checkpointing: enabling repeatability in testbeds and realism in simulations

Author: Dunkels Adam
Eriksson Joakim
Finne Niclas
Tsiftes Nicolas
Voigt Thiemo
Österlind Fredrik
Publication venue
Publication date: 01/01/2009
Field of study

When developing sensor network applications, the shift from simulation to testbed causes application failures, resulting in additional time-consuming iterations between simulation and testbed. We propose transferring sensor network checkpoints between simulation and testbed to reduce the gap between simulation and testbed. Sensornet checkpointing combines the best of both simulation and testbeds: the nonintrusiveness and repeatability of simulation, and the realism of testbeds

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive