1 research outputs found

    Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems

    No full text
    Abstract—xSim is a simulation-based performance investiga-tion toolkit that permits running high-performance computing (HPC) applications in a controlled environment with millions of concurrent execution threads, while observing application performance in a simulated extreme-scale system for hard-ware/software co-design. The presented work details newly developed features for xSim that permit the injection of MPI process failures, the propagation/detection/notification of such failures within the simulation, and their handling using application-level checkpoint/restart. These new capabilities en-able the observation of application behavior and performance under failure within a simulated future-generation HPC system using the most common fault handling technique
    corecore