57 research outputs found
POSH: Paris OpenSHMEM: A High-Performance OpenSHMEM Implementation for Shared Memory Systems
In this paper we present the design and implementation of POSH, an
Open-Source implementation of the OpenSHMEM standard. We present a model for
its communications, and prove some properties on the memory model defined in
the OpenSHMEM specification. We present some performance measurements of the
communication library featured by POSH and compare them with an existing
one-sided communication library. POSH can be downloaded from
\url{http://www.lipn.fr/~coti/POSH}. % 9 - 67Comment: This is an extended version (featuring the full proofs) of a paper
accepted at ICCS'1
DART-MPI: An MPI-based Implementation of a PGAS Runtime System
A Partitioned Global Address Space (PGAS) approach treats a distributed
system as if the memory were shared on a global level. Given such a global view
on memory, the user may program applications very much like shared memory
systems. This greatly simplifies the tasks of developing parallel applications,
because no explicit communication has to be specified in the program for data
exchange between different computing nodes. In this paper we present DART, a
runtime environment, which implements the PGAS paradigm on large-scale
high-performance computing clusters. A specific feature of our implementation
is the use of one-sided communication of the Message Passing Interface (MPI)
version 3 (i.e. MPI-3) as the underlying communication substrate. We evaluated
the performance of the implementation with several low-level kernels in order
to determine overheads and limitations in comparison to the underlying MPI-3.Comment: 11 pages, International Conference on Partitioned Global Address
Space Programming Models (PGAS14
Development and performance comparison of MPI and Fortran Coarrays within an atmospheric research model
A mini-application of The Intermediate Complexity Research (ICAR) Model offers an opportunity to compare the costs and performance of the Message Passing Interface (MPI) versus coarray Fortran, two methods of communication across processes. The application requires repeated communication of halo regions, which is performed with either MPI or coarrays. The MPI communication is done using non-blocking two-sided communication, while the coarray library is implemented using a one-sided MPI or OpenSHMEM communication backend. We examine the development cost in addition to strong and weak scalability analysis to understand the performance costs
Toward Reliable and Efficient Message Passing Software for HPC Systems: Fault Tolerance and Vector Extension
As the scale of High-performance Computing (HPC) systems continues to grow, researchers are devoted themselves to achieve the best performance of running long computing jobs on these systems. My research focus on reliability and efficiency study for HPC software.
First, as systems become larger, mean-time-to-failure (MTTF) of these HPC systems is negatively impacted and tends to decrease. Handling system failures becomes a prime challenge. My research aims to present a general design and implementation of an efficient runtime-level failure detection and propagation strategy targeting large-scale, dynamic systems that is able to detect both node and process failures. Using multiple overlapping topologies to optimize the detection and propagation, minimizing the incurred overhead sand guaranteeing the scalability of the entire framework. Results from different machines and benchmarks compared to related works shows that my design and implementation outperforms non-HPC solutions significantly, and is competitive with specialized HPC solutions that can manage only MPI applications.
Second, I endeavor to implore instruction level parallelization to achieve optimal performance. Novel processors support long vector extensions, which enables researchers to exploit the potential peak performance of target architectures. Intel introduced Advanced Vector Extension (AVX512 and AVX2) instructions for x86 Instruction Set Architecture (ISA). Arm introduced Scalable Vector Extension (SVE) with a new set of A64 instructions. Both enable greater parallelisms. My research utilizes long vector reduction instructions to improve the performance of MPI reduction operations. Also, I use gather and scatter feature to speed up the packing and unpacking operation in MPI. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architecture and efficient
A Multithreaded Communication Substrate for OpenSHMEM
ABSTRACT OpenSHMEM scalability is strongly dependent on the capability of its communication layer to efficiently handle multiple threads. In this paper, we present an early evaluation of the thread safety specification in the Unified Common Communication Substrate (UCCS) employed in OpenSHMEM. Results demonstrate that thread safety can be provided at an acceptable cost and can improve efficiency for some operations, compared to serializing communication
Native Mode-Based Optimizations of Remote Memory Accesses in OpenSHMEM for Intel Xeon Phi
ABSTRACT OpenSHMEM is a PGAS library that aims to deliver high performance while retaining portability. Communication operations are a major obstacle to scalable parallel performance and are highly dependent on the target architecture. However, to date there has been no work on how to efficiently support OpenSHMEM running natively on Intel Xeon Phi, a highly-parallel, power-efficient and widely-used many-core architecture. Given the importance of communication in parallel architectures, this paper describes a novel methodology for optimizing remote-memory accesses for execution of OpenSHMEM programs on Intel Xeon Phi processors. In native mode, we can exploit the Xeon Phi shared memory and convert OpenSHMEM one-sided communication calls into local load/store statements using the shmem_ptr routine. This approach makes it possible for the compiler to perform essential optimizations for Xeon Phi such as vectorization. To the best of our knowledge, this is the first time the impact of shmem_ptr is analyzed thoroughly on a manycore system. We show the benefits of this approach on the PGAS-Microbenchmarks we specifically developed for this research. Our results exhibit a decrease in latency for onesided communication operations by up to 60% and increase in bandwidth by up to 12x. Moreover, we study different reduction algorithms and exploit local load/store to optimize data transfers in these algorithms for Xeon Phi which permits improvement of up to 22% compared to MVAPICH and up to 60% compared to Intel MPI. Apart from microbenchmarks, experimental results on NAS IS and SP benchmarks show that performance gains of up to 20x are possible
Exploring Fully Offloaded GPU Stream-Aware Message Passing
Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and
high-speed network interconnects. Communication libraries supporting efficient
data transfers involving memory buffers from the GPU memory typically require
the CPU to orchestrate the data transfer operations. A new offload-friendly
communication strategy, stream-triggered (ST) communication, was explored to
allow offloading the synchronization and data movement operations from the CPU
to the GPU. A Message Passing Interface (MPI) one-sided active target
synchronization based implementation was used as an exemplar to illustrate the
proposed strategy. A latency-sensitive nearest neighbor microbenchmark was used
to explore the various performance aspects of the implementation. The offloaded
implementation shows significant on-node performance advantages over standard
MPI active RMA (36%) and point-to-point (61%) communication. The current
multi-node improvement is less (23% faster than standard active RMA but 11%
slower than point-to-point), but plans are in progress to purse further
improvements.Comment: 12 pages, 17 figure
- …