49 research outputs found
MPI Collectives for Multi-core Clusters: Optimized Performance of the Hybrid MPI+MPI Parallel Codes
The advent of multi-/many-core processors in clusters advocates hybrid
parallel programming, which combines Message Passing Interface (MPI) for
inter-node parallelism with a shared memory model for on-node parallelism.
Compared to the traditional hybrid approach of MPI plus OpenMP, a new, but
promising hybrid approach of MPI plus MPI-3 shared-memory extensions (MPI+MPI)
is gaining attraction. We describe an algorithmic approach for collective
operations (with allgather and broadcast as concrete examples) in the context
of hybrid MPI+MPI, so as to minimize memory consumption and memory copies. With
this approach, only one memory copy is maintained and shared by on-node
processes. This allows the removal of unnecessary on-node copies of replicated
data that are required between MPI processes when the collectives are invoked
in the context of pure MPI. We compare our approach of collectives for hybrid
MPI+MPI and the traditional one for pure MPI, and also have a discussion on the
synchronization that is required to guarantee data integrity. The performance
of our approach has been validated on a Cray XC40 system (Cray MPI) and NEC
cluster (OpenMPI), showing that it achieves comparable or better performance
for allgather operations. We have further validated our approach with a
standard computational kernel, namely distributed matrix multiplication, and a
Bayesian Probabilistic Matrix Factorization code.Comment: 10 pages. Accepted for publication in ICPP Workshops 201
Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters
Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies.
Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously.
To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication.
Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding
Improving the Performance of the MPI_Allreduce Collective Operation through Rank Renaming
Proceedings of: First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014). Porto (Portugal), August 27-28, 2014.Collective operations, a key issue in the global efficiency of HPC applications, are optimized in current MPI libraries by choosing at runtime between a set of algorithms, based on platform-dependent beforehand established parameters, as the message size or the number of processes. However, with progressively more cores per node, the cost of a collective algorithm must be mainly imputed to process-to-processor mapping, because its decisive influence over the network traffic. Hierarchical design of collective algorithms pursuits to minimize the data movement through the slowest communication channels of the multi-core cluster. Nevertheless, the hierarchical implementation of some collectives becomes inefficient, and even impracticable, due to the operation definition itself. This paper proposes a new approach that departs from a frequently found regular mapping, either sequential or round-robin. While keeping the mapping, the rank assignation to the processes is temporarily changed prior to the execution of the collective algorithm. The new assignation makes the communication pattern to adapt to the communication channels hierarchy. We explore this technique for the Ring algorithm when used in the well-known MPI_Allreduce collective, and discuss the obtained performance results. Extensions to other algorithms and collective operations are proposed.The work presented in this paper has been partially supported by EU
under the COST programme Action IC1305, ’Network for Sustainable
Ultrascale Computing (NESUS)’, and by the computing facilities
of Extremadura Research Centre for Advanced Technologies (CETACIEMAT),
funded by the European Regional Development Fund
(ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of
Spain
Distributed-Memory Breadth-First Search on Massive Graphs
This chapter studies the problem of traversing large graphs using the
breadth-first search order on distributed-memory supercomputers. We consider
both the traditional level-synchronous top-down algorithm as well as the
recently discovered direction optimizing algorithm. We analyze the performance
and scalability trade-offs in using different local data structures such as CSR
and DCSC, enabling in-node multithreading, and graph decompositions such as 1D
and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451
Collective-Optimized FFTs
This paper measures the impact of the various alltoallv methods. Results are
analyzed within Beatnik, a Z-model solver that is bottlenecked by HeFFTe and
representative of applications that rely on FFTs
Hierarchical Implementation of Aggregate Functions
Most systems in HPC make use of hierarchical designs that allow multiple levels of parallelism to be exploited by programmers. The use of multiple multi-core/multi-processor computers to form a computer cluster supports both fine-grain and large-grain parallel computation. Aggregate function communications provide an easy to use and efficient set of mechanisms for communicating and coordinating between processing elements, but the model originally targeted only fine grain parallel hardware. This work shows that a hierarchical implementation of aggregate functions is a viable alternative to MPI (the standard Message Passing Interface library) for programming clusters that provide both fine grain and large grain execution. Performance of a prototype implementation is evaluated and compared to that of MPI
Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism
Irregular communication often limits both the performance and scalability of
parallel applications. Typically, applications individually implement irregular
messages using point-to-point communications, and any optimizations are added
directly into the application. As a result, these optimizations lack
portability. There is no easy way to optimize point-to-point messages within
MPI, as the interface for single messages provides no information on the
collection of all communication to be performed. However, the persistent
neighbor collective API, released in the MPI 4 standard, provides an interface
for portable optimizations of irregular communication within MPI libraries.
This paper presents methods for optimizing irregular communication within
neighborhood collectives, analyzes the impact of replacing point-to-point
communication in existing codebases such as Hypre BoomerAMG with neighborhood
collectives, and finally shows an up to 1.32x speedup on sparse matrix-vector
multiplication within a BoomerAMG solve through the use of our optimized
neighbor collectives. The authors analyze multiple implementations of
neighborhood collectives, including a standard implementation, which simply
wraps standard point-to-point communication, as well as multiple
implementations of locality-aware aggregation. All optimizations are available
in an open-source codebase, MPI Advance, which sits on top of MPI, allowing for
optimizations to be added into existing codebases regardless of the system MPI
install