192 research outputs found
Kernel-assisted and Topology-aware MPI Collective Communication among Multicore or Many-core Clusters
Multicore or many-core clusters have become the most prominent form of High Performance Computing (HPC) systems. Hardware complexity and hierarchies not only exist in the inter-node layer, i.e., hierarchical networks, but also exist in internals of multicore compute nodes, e.g., Non Uniform Memory Accesses (NUMA), network-style interconnect, and memory and shared cache hierarchies.
Message Passing Interface (MPI), the most widely adopted in the HPC communities, suffers from decreased performance and portability due to increased hardware complexity of multiple levels. We identified three critical issues specific to collective communication: The first problem arises from the gap between logical collective topologies and underlying hardware topologies; Second, current MPI communications lack efficient shared memory message delivering approaches; Last, on distributed memory machines, like multicore clusters, a single approach cannot encompass the extreme variations not only in the bandwidth and latency capabilities, but also in features such as the aptitude to operate multiple concurrent copies simultaneously.
To bridge the gap between logical collective topologies and hardware topologies, we developed a distance-aware framework to integrate the knowledge of hardware distance into collective algorithms in order to dynamically reshape the communication patterns to suit the hardware capabilities. Based on process distance information, we used graph partitioning techniques to organize the MPI processes in a multi-level hierarchy, mapping on the hardware characteristics. Meanwhile, we took advantage of the kernel-assisted one-sided single-copy approach (KNEM) as the default shared memory delivering method. Via kernel-assisted memory copy, the collective algorithms offload copy tasks onto non-leader/not-root processes to evenly distribute copy workloads among available cores. Finally, on distributed memory machines, we developed a technique to compose multi-layered collective algorithms together to express a multi-level algorithm with tight interoperability between the levels. This tight collaboration results in more overlaps between inter- and intra-node communication.
Experimental results have confirmed that, by leveraging several technologies together, such as kernel-assisted memory copy, the distance-aware framework, and collective algorithm composition, not only do MPI collectives reach the potential maximum performance on a wide variation of platforms, but they also deliver a level of performance immune to modifications of the underlying process-core binding
TZC: Efficient Inter-Process Communication for Robotics Middleware with Partial Serialization
Inter-process communication (IPC) is one of the core functions of modern
robotics middleware. We propose an efficient IPC technique called TZC (Towards
Zero-Copy). As a core component of TZC, we design a novel algorithm called
partial serialization. Our formulation can generate messages that can be
divided into two parts. During message transmission, one part is transmitted
through a socket and the other part uses shared memory. The part within shared
memory is never copied or serialized during its lifetime. We have integrated
TZC with ROS and ROS2 and find that TZC can be easily combined with current
open-source platforms. By using TZC, the overhead of IPC remains constant when
the message size grows. In particular, when the message size is 4MB (less than
the size of a full HD image), TZC can reduce the overhead of ROS IPC from tens
of milliseconds to hundreds of microseconds and can reduce the overhead of ROS2
IPC from hundreds of milliseconds to less than 1 millisecond. We also
demonstrate the benefits of TZC by integrating with TurtleBot2 that are used in
autonomous driving scenarios. We show that by using TZC, the braking distance
can be shortened by 16% than ROS
Programming Models\u27 Support for Heterogeneous Architecture
Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak computational capacity. Heterogeneous systems equipped with accelerators such as GPUs have become the most prominent components of High Performance Computing (HPC) systems. Even at the node level the significant heterogeneity of CPU and GPU, i.e. hardware and memory space differences, leads to challenges for fully exploiting such complex architectures. Extending outside the node scope, only escalate such challenges.
Conventional programming models such as data- ow and message passing have been widely adopted in HPC communities. When moving towards heterogeneous systems, the lack of GPU integration causes such programming models to struggle in handling the heterogeneity of different computing units, leading to sub-optimal performance and drastic decrease in developer productivity. To bridge the gap between underlying heterogeneous architectures and current programming paradigms, we propose to extend such programming paradigms with architecture awareness optimization.
Two programming models are used to demonstrate the impact of heterogeneous architecture awareness. The PaRSEC task-based runtime, an adopter of the data- ow model, provides opportunities for overlapping communications with computations and minimizing data movements, as well as dynamically adapting the work granularity to the capability of the hardware.
To fulfill the demand of an efficient and portable Message Passing Interface (MPI) implementation to communicate GPU data, a GPU-aware design is presented based on the Open MPI infrastructure supporting efficient point-to-point and collective communications of GPU-residential data, for both contiguous and non-contiguous memory layouts, by leveraging GPU network topology and hardware capabilities such as GPUDirect. The tight integration of GPU support in a widely used programming environment, free the developers from manually move data into/out of host memory before/after relying on MPI routines for communications, allowing them to focus instead on algorithmic optimizations.
Experimental results have confirmed that supported by such a tight and transparent integration, conventional programming models can once again take advantage of the state-of-the-art hardware and exhibit performance at the levels expected by the underlying hardware capabilities
KNEM: a Generic and Scalable Kernel-Assisted Intra-node MPI Communication Framework
International audienceThe multiplication of cores in today's architectures raises the importance of intra-node communication in modern clusters and their impact on the overall parallel application performance. Although several proposals focused on this issue in the past, there is still a need for a portable and hardware-independent solution that addresses the requirements of both point-to-point and collective MPI operations inside shared-memory computing nodes. This paper presents the KNEM module for the Linux kernel that provides MPI implementations with a flexible and scalable interface for performing kernel-assisted single-copy data transfers between local processes. It enables high-performance communication within most existing MPI implementations and brings significant application performance improvements thanks to more efficient point-to-point and collective operations
Optimization of MPI Collective Communication Operations
High-performance computing (HPC) systems keep growing in scale and heterogeneity to satisfy the increasing need for computation, and this brings new challenges to the design of Message Passing Interface (MPI) libraries, especially with regard to collective operations.The implementations of state-of-the-art MPI collective operations heavily rely on synchronizations, and these implementations magnify noise across the participating processes, resulting in significant performance slowdowns. Therefore, I create a new collective communication framework in Open MPI, using an event-driven design to relax synchronizations and maintain the minimal data dependencies of MPI collective operations.The recent growth in hardware heterogeneity results in increasingly complex hardware hierarchies and larger communication performance differences.Hence, in this dissertation, I present two approaches to perform hierarchical collective operations, and both can exploit the different bandwidths of hardware in heterogeneous systems and maximizing concurrent communications.Finally, to provide a fast and accurate autotuning mechanism for my framework, I design a new autotuning approach by combining two existing methods. This new approach significantly reduces the search space to save the autotuning time and is still able to provide accurate estimations.I evaluate my work with microbenchmarks and applications at different scales. Microbenchmark results show my work speedups MPI_Bcast and MPI_Allreduce up to 7.34X and 4.86X, respectively, on 4096 processes.In terms of applications, I achieve a 24.3% improvement for Hovorod and a 143% improvement for ASP on 1536 processes as compared to the current Open MPI
Improving the Performance of the MPI_Allreduce Collective Operation through Rank Renaming
Proceedings of: First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014). Porto (Portugal), August 27-28, 2014.Collective operations, a key issue in the global efficiency of HPC applications, are optimized in current MPI libraries by choosing at runtime between a set of algorithms, based on platform-dependent beforehand established parameters, as the message size or the number of processes. However, with progressively more cores per node, the cost of a collective algorithm must be mainly imputed to process-to-processor mapping, because its decisive influence over the network traffic. Hierarchical design of collective algorithms pursuits to minimize the data movement through the slowest communication channels of the multi-core cluster. Nevertheless, the hierarchical implementation of some collectives becomes inefficient, and even impracticable, due to the operation definition itself. This paper proposes a new approach that departs from a frequently found regular mapping, either sequential or round-robin. While keeping the mapping, the rank assignation to the processes is temporarily changed prior to the execution of the collective algorithm. The new assignation makes the communication pattern to adapt to the communication channels hierarchy. We explore this technique for the Ring algorithm when used in the well-known MPI_Allreduce collective, and discuss the obtained performance results. Extensions to other algorithms and collective operations are proposed.The work presented in this paper has been partially supported by EU
under the COST programme Action IC1305, ’Network for Sustainable
Ultrascale Computing (NESUS)’, and by the computing facilities
of Extremadura Research Centre for Advanced Technologies (CETACIEMAT),
funded by the European Regional Development Fund
(ERDF). CETA-CIEMAT belongs to CIEMAT and the Government of
Spain
Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism
Irregular communication often limits both the performance and scalability of
parallel applications. Typically, applications individually implement irregular
messages using point-to-point communications, and any optimizations are added
directly into the application. As a result, these optimizations lack
portability. There is no easy way to optimize point-to-point messages within
MPI, as the interface for single messages provides no information on the
collection of all communication to be performed. However, the persistent
neighbor collective API, released in the MPI 4 standard, provides an interface
for portable optimizations of irregular communication within MPI libraries.
This paper presents methods for optimizing irregular communication within
neighborhood collectives, analyzes the impact of replacing point-to-point
communication in existing codebases such as Hypre BoomerAMG with neighborhood
collectives, and finally shows an up to 1.32x speedup on sparse matrix-vector
multiplication within a BoomerAMG solve through the use of our optimized
neighbor collectives. The authors analyze multiple implementations of
neighborhood collectives, including a standard implementation, which simply
wraps standard point-to-point communication, as well as multiple
implementations of locality-aware aggregation. All optimizations are available
in an open-source codebase, MPI Advance, which sits on top of MPI, allowing for
optimizations to be added into existing codebases regardless of the system MPI
install
HDArray: Parallel Array Interface for Distributed Heterogeneous Devices
Heterogeneous clusters with nodes containing one or more accelerators, such as GPUs, have become common. While MPI provides a mechanism and management of interaddress space communication, and OpenCL provides a way to manage computation and communication within a process with access to heterogeneous computational resources, programmers are forced to write hybrid programs that manage the interaction of both of these systems. This paper describes an array programming interface that provides users with automatic or manual distributions of data and work. Using the distribution and information about what data is used and defined by kernels, communication among processes and among devices in a process is performed automatically. The interface provides a unified programming model to the user, thus simplifying program development
- …