3,879 research outputs found
Lightweight MPI Communicators with Applications to Perfectly Balanced Quicksort
MPI uses the concept of communicators to connect groups of processes. It
provides nonblocking collective operations on communicators to overlap
communication and computation. Flexible algorithms demand flexible
communicators. E.g., a process can work on different subproblems within
different process groups simultaneously, new process groups can be created, or
the members of a process group can change. Depending on the number of
communicators, the time for communicator creation can drastically increase the
running time of the algorithm. Furthermore, a new communicator synchronizes all
processes as communicator creation routines are blocking collective operations.
We present RBC, a communication library based on MPI, that creates
range-based communicators in constant time without communication. These RBC
communicators support (non)blocking point-to-point communication as well as
(non)blocking collective operations. Our experiments show that the library
reduces the time to create a new communicator by a factor of more than 400
whereas the running time of collective operations remains about the same. We
propose Janus Quicksort, a distributed sorting algorithm that avoids any load
imbalances. We improved the performance of this algorithm by a factor of 15 for
moderate inputs by using RBC communicators. Finally, we discuss different
approaches to bring nonblocking (local) communicator creation of lightweight
(range-based) communicators into MPI
MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface
Application development for distributed computing "Grids" can benefit from
tools that variously hide or enable application-level management of critical
aspects of the heterogeneous environment. As part of an investigation of these
issues, we have developed MPICH-G2, a Grid-enabled implementation of the
Message Passing Interface (MPI) that allows a user to run MPI programs across
multiple computers, at the same or different sites, using the same commands
that would be used on a parallel computer. This library extends the Argonne
MPICH implementation of MPI to use services provided by the Globus Toolkit for
authentication, authorization, resource allocation, executable staging, and
I/O, as well as for process creation, monitoring, and control. Various
performance-critical operations, including startup and collective operations,
are configured to exploit network topology information. The library also
exploits MPI constructs for performance management; for example, the MPI
communicator construct is used for application-level discovery of, and
adaptation to, both network topology and network quality-of-service mechanisms.
We describe the MPICH-G2 design and implementation, present performance
results, and review application experiences, including record-setting
distributed simulations.Comment: 20 pages, 8 figure
Mapping applications with collectives over sub-communicators on torus networks
pre-printThe placement of tasks in a parallel application on specific nodes of a supercomputer can significantly impact performance. Traditionally, this task mapping has focused on reducing the distance between communicating tasks on the physical network. This minimizes the number of hops that point-to-point messages travel and thus reduces link sharing between messages and contention. However, for applications that use collectives over sub-communicators, this heuristic may not be optimal. Many collectives can benefit from an increase in bandwidth even at the cost of an increase in hop count, especially when sending large messages. For example, placing communicating tasks in a cube configuration rather than a plane or a line on a torus network increases the number of possible paths messages might take. This increases the available bandwidth which can lead to significant performance gains. We have developed Rubik, a tool that provides a simple and intuitive interface to create a wide variety of mappings for structured communication patterns. Rubik supports a number of elementary operations such as splits, tilts, or shifts, that can be combined into a large number of unique patterns. Each operation can be applied to disjoint groups of processes involved in collectives to increase the effective bandwidth. We demonstrate the use of Rubik for improving performance of two parallel codes, pF3D and Qbox, which use collectives over sub-communicators
DART-MPI: An MPI-based Implementation of a PGAS Runtime System
A Partitioned Global Address Space (PGAS) approach treats a distributed
system as if the memory were shared on a global level. Given such a global view
on memory, the user may program applications very much like shared memory
systems. This greatly simplifies the tasks of developing parallel applications,
because no explicit communication has to be specified in the program for data
exchange between different computing nodes. In this paper we present DART, a
runtime environment, which implements the PGAS paradigm on large-scale
high-performance computing clusters. A specific feature of our implementation
is the use of one-sided communication of the Message Passing Interface (MPI)
version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated
the performance of the implementation with several low-level kernels in order
to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address
Space Programming Models (PGAS14
Experiences with OpenMP in tmLQCD
An overview is given of the lessons learned from the introduction of
multi-threading using OpenMP in tmLQCD. In particular, programming style,
performance measurements, cache misses, scaling, thread distribution for hybrid
codes, race conditions, the overlapping of communication and computation and
the measurement and reduction of certain overheads are discussed. Performance
measurements and sampling profiles are given for different implementations of
the hopping matrix computational kernel.Comment: presented at the 31st International Symposium on Lattice Field Theory
(Lattice 2013), 29 July - 3 August 2013, Mainz, German
- …