429 research outputs found
Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort
MPI uses the concept of communicators to connect groups of processes. It
provides nonblocking collective operations on communicators to overlap
communication and computation. Flexible algorithms demand flexible
communicators. E.g., a process can work on different subproblems within
different process groups simultaneously, new process groups can be created, or
the members of a process group can change. Depending on the number of
communicators, the time for communicator creation can drastically increase the
running time of the algorithm. Furthermore, a new communicator synchronizes all
processes as communicator creation routines are blocking collective operations.
We present RBC, a communication library based on MPI, that creates
range-based communicators in constant time without communication. These RBC
communicators support (non)blocking point-to-point communication as well as
(non)blocking collective operations. Our experiments show that the library
reduces the time to create a new communicator by a factor of more than 400
whereas the running time of collective operations remains about the same. We
propose Janus Quicksort, a distributed sorting algorithm that avoids any load
imbalances. We improved the performance of this algorithm by a factor of 15 for
moderate inputs by using RBC communicators. Finally, we discuss different
approaches to bring nonblocking (local) communicator creation of lightweight
(range-based) communicators into MPI
Performance analysis and optimization of the FFTXlib on the Intel knights landing architecture
In this paper, we address the decreasing performance of the FFTXlib, the Fast Fourier Transformation (FFT) kernel of Quantum ESPRESSO, when scaling to a full KNL node. An increased performance in the FFTXlib will likewise increase the performance of the entire Quantum ESPRESSO code one of the most used plane-wave DFT codes in the community of material science. Our approach focuses on, first, overlapping computation and communication and, second, decreasing resource contention for higher compute efficiency. In order to achieve this we use the OmpSs programming model based on task dependencies. We allow overlapping of computation and communication by converting all steps of the FFT into tasks following a flow dependency. In the same way, we decrease resource contention by converting each FFT into an individual task that can be scheduled asynchronously. In both cases, multiple FFTs can be computed in parallel. The task-based optimizations are implemented in the FFTXlib and show up to 10% runtime reduction on the already highly optimized version. Since the task scheduling is done dynamically during execution by the parallel runtime, not statically by the user, it also frees the user from finding the ideal parallel configuration himself.We gratefully acknowledge the support of the MaX and POP projects, which have received funding from the European Union’s Horizon 2020 research and innovation programme
under grant agreement No. 676598 and 676553, respectively.Peer ReviewedPostprint (author's final draft
DART-MPI: An MPI-based Implementation of a PGAS Runtime System
A Partitioned Global Address Space (PGAS) approach treats a distributed
system as if the memory were shared on a global level. Given such a global view
on memory, the user may program applications very much like shared memory
systems. This greatly simplifies the tasks of developing parallel applications,
because no explicit communication has to be specified in the program for data
exchange between different computing nodes. In this paper we present DART, a
runtime environment, which implements the PGAS paradigm on large-scale
high-performance computing clusters. A specific feature of our implementation
is the use of one-sided communication of the Message Passing Interface (MPI)
version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated
the performance of the implementation with several low-level kernels in order
to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address
Space Programming Models (PGAS14
Failure Avoidance in MPI Applications Using an Application-Level Approach
[Abstract] Execution times of large-scale computational science and engineering parallel applications are usually longer than the mean-time-between-failures. For this reason, hardware failures must be tolerated by the applications to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to provide fault tolerance support to parallel applications. However, when a failure occurs, most checkpointing mechanisms require a complete restart of the parallel application from the last checkpoint. New advances in the prediction of hardware failures have led to the development of proactive process migration approaches, where tasks are migrated in a preventive way when node failures are anticipated, avoiding the restart of the whole application. The work presented in this paper extends an application-level checkpointing framework to proactively migrate message passing interface (MPI) processes when impending failures are notified, without having to restart the entire application. The main features of the proposed solution are: low overhead in failure-free executions, avoiding the checkpoint dumping associated to rolling back strategies; low overhead at migration time, by means of the design of a light and asynchronous protocol to achieve a consistent global state; transparency for the user, thanks to the use of a compiler tool and a runtime library and portability, as it is not locked into a particular architecture, operating system or MPI implementation.Ministerio de Ciencia e InnovaciĂłn; TIN2010-16735Galicia. ConsellerĂa de EconomĂa e Industria; 10PXIB105180P
Parallel Implementation of the PHOENIX Generalized Stellar Atmosphere Program. II: Wavelength Parallelization
We describe an important addition to the parallel implementation of our
generalized NLTE stellar atmosphere and radiative transfer computer program
PHOENIX. In a previous paper in this series we described data and task parallel
algorithms we have developed for radiative transfer, spectral line opacity, and
NLTE opacity and rate calculations. These algorithms divided the work spatially
or by spectral lines, that is distributing the radial zones, individual
spectral lines, or characteristic rays among different processors and employ,
in addition task parallelism for logically independent functions (such as
atomic and molecular line opacities). For finite, monotonic velocity fields,
the radiative transfer equation is an initial value problem in wavelength, and
hence each wavelength point depends upon the previous one. However, for
sophisticated NLTE models of both static and moving atmospheres needed to
accurately describe, e.g., novae and supernovae, the number of wavelength
points is very large (200,000--300,000) and hence parallelization over
wavelength can lead both to considerable speedup in calculation time and the
ability to make use of the aggregate memory available on massively parallel
supercomputers. Here, we describe an implementation of a pipelined design for
the wavelength parallelization of PHOENIX, where the necessary data from the
processor working on a previous wavelength point is sent to the processor
working on the succeeding wavelength point as soon as it is known. Our
implementation uses a MIMD design based on a relatively small number of
standard MPI library calls and is fully portable between serial and parallel
computers.Comment: AAS-TeX, 15 pages, full text with figures available at
ftp://calvin.physast.uga.edu/pub/preprints/Wavelength-Parallel.ps.gz ApJ, in
pres
- …