18 research outputs found
Fault tolerance of MPI applications in exascale systems: The ULFM solution
[Abstract]
The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future exascale systems, not only to ensure the completion of their execution in these systems but also to improve their energy consumption. Although the Message Passing Interface (MPI) is the most popular programming model for distributed-memory HPC systems, as of now, it does not provide any fault-tolerant construct for users to handle failures. Thus, the recovery procedure is postponed until the application is aborted and re-spawned. The proposal of the User Level Failure Mitigation (ULFM) interface in the MPI forum provides new opportunities in this field, enabling the implementation of resilient MPI applications, system runtimes, and programming language constructs able to detect and react to failures without aborting their execution. This paper presents a global overview of the resilience interfaces provided by the ULFM specification, covers archetypal usage patterns and building blocks, and surveys the wide variety of application-driven solutions that have exploited them in recent years. The large and varied number of approaches in the literature proves that ULFM provides the necessary flexibility to implement efficient fault-tolerant MPI applications. All the proposed solutions are based on application-driven recovery mechanisms, which allows reducing the overhead and obtaining the required level of efficiency needed in the future exascale platforms.Ministerio de EconomĂa y Competitividad and FEDER; TIN2016-75845-PXunta de Galicia; ED431C 2017/04National Science Foundation of the United States; NSF-SI2 #1664142Exascale Computing Project; 17-SC-20-SCHoneywell International, Inc.; DE-NA000352
Hybrid time-dependent Ginzburg-Landau simulations of block copolymer nanocomposites: nanoparticle anisotropy
Block copolymer melts are perfect candidates to template the position of colloidal nanoparticles in the nanoscale, on top of their well-known suitability for lithography applications. This is due to their ability to self-assemble into periodic ordered structures, in which nanoparticles can segregate depending on the polymer-particle interactions, size and shape. The resulting coassembled structure can be highly ordered as a combination of both the polymeric and colloidal properties. The time-dependent Ginzburg-Landau model for the block copolymer was combined with Brownian dynamics for nanoparticles, resulting in an efficient mesoscopic model to study the complex behaviour of block copolymer nanocomposites. This review covers recent developments of the time-dependent Ginzburg-Landau/Brownian dynamics scheme. This includes efforts to parallelise the numerical scheme and applications of the model. The validity of the model is studied by comparing simulation and experimental results for isotropic nanoparticles. Extensions to simulate nonspherical and inhomogeneous nanoparticles are discussed and simulation results are discussed. The time-dependent Ginzburg-Landau/Brownian dynamics scheme is shown to be a flexible method which can account for the relatively large system sizes required to study block copolymer nanocomposite systems, while being easily extensible to simulate nonspherical nanoparticles
Implicit Actions and Non-blocking Failure Recovery with MPI
Scientific applications have long embraced the MPI as the environment of
choice to execute on large distributed systems. The User-Level Failure
Mitigation (ULFM) specification extends the MPI standard to address resilience
and enable MPI applications to restore their communication capability after a
failure. This works builds upon the wide body of experience gained in the field
to eliminate a gap between current practice and the ideal, more asynchronous,
recovery model in which the fault tolerance activities of multiple components
can be carried out simultaneously and overlap. This work proposes to: (1)
provide the required consistency in fault reporting to applications (i.e.,
enable an application to assess the success of a computational phase without
incurring an unacceptable performance hit); (2) bring forward the building
blocks that permit the effective scoping of fault recovery in an application,
so that independent components in an application can recover without
interfering with each other, and separate groups of processes in the
application can recover independently or in unison; and (3) overlap recovery
activities necessary to restore the consistency of the system (e.g., eviction
of faulty processes from the communication group) with application recovery
activities (e.g., dataset restoration from checkpoints).Comment: Accepted in FTXS'22 https://sites.google.com/view/ftxs202
Hybrid Time-Dependent GinzburgâLandau Simulations of Block Copolymer Nanocomposites: Nanoparticle Anisotropy
Block copolymer melts are perfect candidates to template the position of colloidal nanoparticles
in the nanoscale, on top of their well-known suitability for lithography applications. This is
due to their ability to self-assemble into periodic ordered structures, in which nanoparticles can
segregate depending on the polymerâparticle interactions, size and shape. The resulting coassembled
structure can be highly ordered as a combination of both the polymeric and colloidal properties. The
time-dependent GinzburgâLandau model for the block copolymer was combined with Brownian dynamics
for nanoparticles, resulting in an efficient mesoscopic model to study the complex behaviour
of block copolymer nanocomposites. This review covers recent developments of the time-dependent
GinzburgâLandau/Brownian dynamics scheme. This includes efforts to parallelise the numerical
scheme and applications of the model. The validity of the model is studied by comparing simulation
and experimental results for isotropic nanoparticles. Extensions to simulate nonspherical and inhomogeneous
nanoparticles are discussed and simulation results are discussed. The time-dependent
GinzburgâLandau/Brownian dynamics scheme is shown to be a flexible method which can account
for the relatively large system sizes required to study block copolymer nanocomposite systems, while
being easily extensible to simulate non-spherical nanoparticles
X10 for high-performance scientific computing
High performance computing is a key technology that enables large-scale physical
simulation in modern science. While great advances have been made in methods and
algorithms for scientific computing, the most commonly used programming models
encourage a fragmented view of computation that maps poorly to the underlying
computer architecture.
Scientific applications typically manifest physical locality, which means that interactions
between entities or events that are nearby in space or time are stronger
than more distant interactions. Linear-scaling methods exploit physical locality by approximating
distant interactions, to reduce computational complexity so that cost is
proportional to system size. In these methods, the computation required for each
portion of the system is different depending on that portionâs contribution to the
overall result. To support productive development, application programmers need
programming models that cleanly map aspects of the physical system being simulated
to the underlying computer architecture while also supporting the irregular
workloads that arise from the fragmentation of a physical system.
X10 is a new programming language for high-performance computing that uses
the asynchronous partitioned global address space (APGAS) model, which combines
explicit representation of locality with asynchronous task parallelism. This thesis
argues that the X10 language is well suited to expressing the algorithmic properties
of locality and irregular parallelism that are common to many methods for physical
simulation.
The work reported in this thesis was part of a co-design effort involving researchers
at IBM and ANU in which two significant computational chemistry codes
were developed in X10, with an aim to improve the expressiveness and performance
of the language. The first is a HartreeâFock electronic structure code, implemented
using the novel Resolution of the Coulomb Operator approach. The second evaluates
electrostatic interactions between point charges, using either the smooth particle
mesh Ewald method or the fast multipole method, with the latter used to simulate
ion interactions in a Fourier Transform Ion Cyclotron Resonance mass spectrometer.
We compare the performance of both X10 applications to state-of-the-art software
packages written in other languages.
This thesis presents improvements to the X10 language and runtime libraries for
managing and visualizing the data locality of parallel tasks, communication using
active messages, and efficient implementation of distributed arrays. We evaluate these improvements in the context of computational chemistry application examples.
This work demonstrates that X10 can achieve performance comparable to established
programming languages when running on a single core. More importantly,
X10 programs can achieve high parallel efficiency on a multithreaded architecture,
given a divide-and-conquer pattern parallel tasks and appropriate use of worker-local
data. For distributed memory architectures, X10 supports the use of active messages
to construct local, asynchronous communication patterns which outperform global,
synchronous patterns. Although point-to-point active messages may be implemented
efficiently, productive application development also requires collective communications;
more work is required to integrate both forms of communication in the X10
language. The exploitation of locality is the key insight in both linear-scaling methods and
the APGAS programming model; their combination represents an attractive opportunity
for future co-design efforts
High performance computing applications: Inter-process communication, workflow optimization, and deep learning for computational nuclear physics
Various aspects of high performance computing (HPC) are addressed in this thesis. The main focus is on analyzing and suggesting novel ideas to improve an application\u27s performance and scalability on HPC systems and to make the most out of the available computational resources.
The choice of inter-process communication is one of the main factors that can influence an application\u27s performance. This study investigates other computational paradigms, such as one-sided communication, that was known to improve the efficiency of current implementation methods. We compare the performance and scalability of the SHMEM and corresponding MPI-3 routines for five different benchmark tests using a Cray XC30. The performance of the MPI-3 get and put operations was evaluated using fence synchronization and also using lock-unlock synchronization. The five tests used communication patterns ranging from light to heavy data traffic: accessing distant messages, circular right shift, gather, broadcast and all-to-all. Each implementation was run using message sizes of 8 bytes, 10 Kbytes and 1 Mbyte and up to 768 processes. For nearly all tests, the SHMEM get and put implementations outperformed the MPI-3 get and put implementations. We noticed significant performance increase using MPI-3 instead of MPI-2 when compared with performance results from previous studies. One can use this performance and scalability analysis to choose the implementation method best suited for a particular application to run on a specific HPC machine.
Today\u27s HPC machines are complex and constantly evolving, making it important to be able to easily evaluate the performance and scalability of HPC applications on both existing and new HPC computers. The evaluation of the performance of applications can be time consuming and tedious. HPC-Bench is a general purpose tool used to optimize benchmarking workflow for HPC to aid in the efficient evaluation of performance using multiple applications on an HPC machine with only a click of a button . HPC-Bench allows multiple applications written in different languages, with multiple parallel versions, using multiple numbers of processes/threads to be evaluated. Performance results are put into a database, which is then queried for the desired performance data, and then the R statistical software package is used to generate the desired graphs and tables. The use of HPC-Bench is illustrated with complex applications that were run on the National Energy Research Scientific Computing Center\u27s (NERSC) Edison Cray XC30 HPC computer.
With the advancement of HPC machines, one needs efficient algorithms and new tools to make the most out of available computational resources. This work also discusses a novel application of deep learning to a nuclear physics application. In recent years, several successful applications of the artificial neural networks (ANNs) have emerged in nuclear physics and high-energy physics, as well as in biology, chemistry, meteorology, and other fields of science. A major goal of nuclear theory is to predict nuclear structure and nuclear reactions from the underlying theory of the strong interactions, Quantum Chromodynamics (QCD). The nuclear quantum many-body problem is a computationally hard problem to solve. With access to powerful HPC systems, several ab initio approaches, such as the No-Core Shell Model (NCSM), have been developed for approximately solving finite nuclei with realistic strong interactions. However, to accurately solve for the properties of atomic nuclei, one faces immense theoretical and computational challenges. To obtain the nuclear physics observables as close as possible to the exact results, one seeks NCSM solutions in the largest feasible basis spaces. These results obtained in a finite basis, are then used to extrapolate to the infinite basis space limit and thus, obtain results corresponding to the complete basis within evaluated uncertainties. Each observable requires a separate extrapolation and most observables have no proven extrapolation method. We propose a feed-forward ANN method as an extrapolation tool to obtain the ground state energy and the ground state point-proton root-mean-square (rms) radius along with their extrapolation uncertainties. The designed ANNs are sufficient to produce results for these two very different observables in ^6Li from the ab initio NCSM results in small basis spaces that satisfy the following theoretical physics condition: independence of basis space parameters in the limit of extremely large matrices. Comparisons of the ANN results with other extrapolation methods are also provided
Parallel Processes in HPX: Designing an Infrastructure for Adaptive Resource Management
Advancement in cutting edge technologies have enabled better energy efficiency as well as scaling computational power for the latest High Performance Computing(HPC) systems. However, complexity, due to hybrid architectures as well as emerging classes of applications, have shown poor computational scalability using conventional execution models. Thus alternative means of computation, that addresses the bottlenecks in computation, is warranted. More precisely, dynamic adaptive resource management feature, both from systems as well as application\u27s perspective, is essential for better computational scalability and efficiency. This research presents and expands the notion of Parallel Processes as a placeholder for procedure definitions, targeted at one or more synchronous domains, meta data for computation and resource management as well as infrastructure for dynamic policy deployment. In addition to this, the research presents additional guidelines for a framework for resource management in HPX runtime system. Further, this research also lists design principles for scalability of Active Global Address Space (AGAS), a necessary feature for Parallel Processes. Also, to verify the usefulness of Parallel Processes, a preliminary performance evaluation of different task scheduling policies is carried out using two different applications. The applications used are: Unbalanced Tree Search, a reference dynamic graph application, implemented by this research in HPX and MiniGhost, a reference stencil based application using bulk synchronous parallel model. The results show that different scheduling policies provide better performance for different classes of applications; and for the same application class, in certain instances, one policy fared better than the others, while vice versa in other instances, hence supporting the hypothesis of the need of dynamic adaptive resource management infrastructure, for deploying different policies and task granularities, for scalable distributed computing