1,397 research outputs found
Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension
As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software.
First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications.
Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient
A runtime heuristic to selectively replicate tasks for application-specific reliability targets
In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App_FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.This work was supported by FI-DGR 2013 scholarship and the European Community’s
Seventh Framework Programme [FP7/2007-2013] under the Mont-blanc 2
Project (www.montblanc-project.eu), grant agreement no. 610402 and in part by the
European Union (FEDER funds) under contract TIN2015-65316-P.Peer ReviewedPostprint (author's final draft
Recommended from our members
Epidemic failure detection and consensus for extreme parallelism
Future extreme-scale high-performance computing systems will be required
to work under frequent component failures. The MPI Forum’s User
Level Failure Mitigation proposal has introduced an operation,
MPI Comm shrink, to synchronize the alive processes on the list of failed
processes, so that applications can continue to execute even in the presence
of failures by adopting algorithm-based fault tolerance techniques. This
MPI Comm shrink operation requires a failure detection and consensus
algorithm. This paper presents three novel failure detection and consensus
algorithms using Gossiping. Stochastic pinging is used to quickly detect
failures during the execution of the algorithm, failures are then disseminated
to all the fault-free processes in the system and consensus on the
failures is detected using the three consensus techniques. The proposed
algorithms were implemented and tested using the Extreme-scale Simulator.
The results show that the stochastic pinging detects all the failures in
the system. In all the algorithms, the number of Gossip cycles to achieve
global consensus scales logarithmically with system size. The second algorithm
also shows better scalability in terms of memory and network
bandwidth usage and a perfect synchronization in achieving global consensus.
The third approach is a three-phase distributed failure detection
and consensus algorithm and provides consistency guarantees even in very
large and extreme-scale systems while at the same time being memory and
bandwidth efficient
A Pattern Language for High-Performance Computing Resilience
High-performance computing systems (HPC) provide powerful capabilities for
modeling, simulation, and data analytics for a broad class of computational
problems. They enable extreme performance of the order of quadrillion
floating-point arithmetic calculations per second by aggregating the power of
millions of compute, memory, networking and storage components. With the
rapidly growing scale and complexity of HPC systems for achieving even greater
performance, ensuring their reliable operation in the face of system
degradations and failures is a critical challenge. System fault events often
lead the scientific applications to produce incorrect results, or may even
cause their untimely termination. The sheer number of components in modern
extreme-scale HPC systems and the complex interactions and dependencies among
the hardware and software components, the applications, and the physical
environment makes the design of practical solutions that support fault
resilience a complex undertaking. To manage this complexity, we developed a
methodology for designing HPC resilience solutions using design patterns. We
codified the well-known techniques for handling faults, errors and failures
that have been devised, applied and improved upon over the past three decades
in the form of design patterns. In this paper, we present a pattern language to
enable a structured approach to the development of HPC resilience solutions.
The pattern language reveals the relations among the resilience patterns and
provides the means to explore alternative techniques for handling a specific
fault model that may have different efficiency and complexity characteristics.
Using the pattern language enables the design and implementation of
comprehensive resilience solutions as a set of interconnected resilience
patterns that can be instantiated across layers of the system stack.Comment: Proceedings of the 22nd European Conference on Pattern Languages of
Program
Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery
Efficient utilization of today's high-performance computing (HPC) systems
with complex hardware and software components requires that the HPC
applications are designed to tolerate process failures at runtime. With low
mean time to failure (MTTF) of current and future HPC systems, long running
simulations on these systems require capabilities for gracefully handling
process failures by the applications themselves. In this paper, we explore the
use of fault tolerance extensions to Message Passing Interface (MPI) called
user-level failure mitigation (ULFM) for handling process failures without the
need to discard the progress made by the application. We explore two
alternative recovery strategies, which use ULFM along with application-driven
in-memory checkpointing. In the first case, the application is recovered with
only the surviving processes, and in the second case, spares are used to
replace the failed processes, such that the original configuration of the
application is restored. Our experimental results demonstrate that graceful
degradation is a viable alternative for recovery in environments where spares
may not be available.Comment: 26th Euromicro International Conference on Parallel, Distributed and
network-based Processing (PDP 2018
Fault tolerance of MPI applications in exascale systems: The ULFM solution
[Abstract]
The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient fault-tolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms.Ministerio de EconomĂa y Competitividad and FEDER; TIN2016-75845-PXunta de Galicia; ED431C 2017/04National Science Foundation of the United States; NSF-SI2 #1664142Exascale Computing Project; 17-SC-20-SCHoneywell International, Inc.; DE-NA000352
What does fault tolerant Deep Learning need from MPI?
Deep Learning (DL) algorithms have become the de facto Machine Learning (ML)
algorithm for large scale data analysis. DL algorithms are computationally
expensive - even distributed DL implementations which use MPI require days of
training (model learning) time on commonly studied datasets. Long running DL
applications become susceptible to faults - requiring development of a fault
tolerant system infrastructure, in addition to fault tolerant DL algorithms.
This raises an important question: What is needed from MPI for de- signing
fault tolerant DL implementations? In this paper, we address this problem for
permanent faults. We motivate the need for a fault tolerant MPI specification
by an in-depth consideration of recent innovations in DL algorithms and their
properties, which drive the need for specific fault tolerance features. We
present an in-depth discussion on the suitability of different parallelism
types (model, data and hybrid); a need (or lack thereof) for check-pointing of
any critical data structures; and most importantly, consideration for several
fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI
and their applicability to fault tolerant DL implementations. We leverage a
distributed memory implementation of Caffe, currently available under the
Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches
by ex- tending MaTEx-Caffe for using ULFM-based implementation. Our evaluation
using the ImageNet dataset and AlexNet, and GoogLeNet neural network topologies
demonstrates the effectiveness of the proposed fault tolerant DL implementation
using OpenMPI based ULFM
- …