1,546 research outputs found

    Deterministic Computations on a PRAM with Static Processor and Memory Faults.

    Get PDF
    We consider Parallel Random Access Machine (PRAM) which has some processors and memory cells faulty. The faults considered are static, i.e., once the machine starts to operate, the operational/faulty status of PRAM components does not change. We develop a deterministic simulation of a fully operational PRAM on a similar faulty machine which has constant fractions of faults among processors and memory cells. The simulating PRAM has nn processors and mm memory cells, and simulates a PRAM with nn processors and a constant fraction of mm memory cells. The simulation is in two phases: it starts with preprocessing, which is followed by the simulation proper performed in a step-by-step fashion. Preprocessing is performed in time O((mn+log⁥n)log⁥n)O((\frac{m}{n}+ \log n)\log n). The slowdown of a step-by-step part of the simulation is O(log⁥m)O(\log m)

    Quantum Monte Carlo for large chemical systems: Implementing efficient strategies for petascale platforms and beyond

    Full text link
    Various strategies to implement efficiently QMC simulations for large chemical systems are presented. These include: i.) the introduction of an efficient algorithm to calculate the computationally expensive Slater matrices. This novel scheme is based on the use of the highly localized character of atomic Gaussian basis functions (not the molecular orbitals as usually done), ii.) the possibility of keeping the memory footprint minimal, iii.) the important enhancement of single-core performance when efficient optimization tools are employed, and iv.) the definition of a universal, dynamic, fault-tolerant, and load-balanced computational framework adapted to all kinds of computational platforms (massively parallel machines, clusters, or distributed grids). These strategies have been implemented in the QMC=Chem code developed at Toulouse and illustrated with numerical applications on small peptides of increasing sizes (158, 434, 1056 and 1731 electrons). Using 10k-80k computing cores of the Curie machine (GENCI-TGCC-CEA, France) QMC=Chem has been shown to be capable of running at the petascale level, thus demonstrating that for this machine a large part of the peak performance can be achieved. Implementation of large-scale QMC simulations for future exascale platforms with a comparable level of efficiency is expected to be feasible

    Keeping checkpoint/restart viable for exascale systems

    Get PDF
    Next-generation exascale systems, those capable of performing a quintillion operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoints) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms
    • 

    corecore