6,860 research outputs found

    A Multilevel Approach to Topology-Aware Collective Operations in Computational Grids

    Full text link
    The efficient implementation of collective communiction operations has received much attention. Initial efforts produced "optimal" trees based on network communication models that assumed equal point-to-point latencies between any two processes. This assumption is violated in most practical settings, however, particularly in heterogeneous systems such as clusters of SMPs and wide-area "computational Grids," with the result that collective operations perform suboptimally. In response, more recent work has focused on creating topology-aware trees for collective operations that minimize communication across slower channels (e.g., a wide-area network). While these efforts have significant communication benefits, they all limit their view of the network to only two layers. We present a strategy based upon a multilayer view of the network. By creating multilevel topology-aware trees we take advantage of communication cost differences at every level in the network. We used this strategy to implement topology-aware versions of several MPI collective operations in MPICH-G2, the Globus Toolkit[tm]-enabled version of the popular MPICH implementation of the MPI standard. Using information about topology provided by MPICH-G2, we construct these multilevel topology-aware trees automatically during execution. We present results demonstrating the advantages of our multilevel approach by comparing it to the default (topology-unaware) implementation provided by MPICH and a topology-aware two-layer implementation.Comment: 16 pages, 8 figure

    Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs

    Full text link
    Convolution is a fundamental operation in many applications, such as computer vision, natural language processing, image processing, etc. Recent successes of convolutional neural networks in various deep learning applications put even higher demand on fast convolution. The high computation throughput and memory bandwidth of graphics processing units (GPUs) make GPUs a natural choice for accelerating convolution operations. However, maximally exploiting the available memory bandwidth of GPUs for convolution is a challenging task. This paper introduces a general model to address the mismatch between the memory bank width of GPUs and computation data width of threads. Based on this model, we develop two convolution kernels, one for the general case and the other for a special case with one input channel. By carefully optimizing memory access patterns and computation patterns, we design a communication-optimized kernel for the special case and a communication-reduced kernel for the general case. Experimental data based on implementations on Kepler GPUs show that our kernels achieve 5.16X and 35.5% average performance improvement over the latest cuDNN library, for the special case and the general case, respectively

    Making the case for reforming the I/O software stack of extreme-scale systems

    Get PDF
    This work was supported in part by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research, under Contract No. DE-AC02-05CH11231. This research has been partially funded by the Spanish Ministry of Science and Innovation under grant TIN2010-16497 “Input/Output techniques for distributed and high-performance computing environments”. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 328582

    MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface

    Full text link
    Application development for distributed computing "Grids" can benefit from tools that variously hide or enable application-level management of critical aspects of the heterogeneous environment. As part of an investigation of these issues, we have developed MPICH-G2, a Grid-enabled implementation of the Message Passing Interface (MPI) that allows a user to run MPI programs across multiple computers, at the same or different sites, using the same commands that would be used on a parallel computer. This library extends the Argonne MPICH implementation of MPI to use services provided by the Globus Toolkit for authentication, authorization, resource allocation, executable staging, and I/O, as well as for process creation, monitoring, and control. Various performance-critical operations, including startup and collective operations, are configured to exploit network topology information. The library also exploits MPI constructs for performance management; for example, the MPI communicator construct is used for application-level discovery of, and adaptation to, both network topology and network quality-of-service mechanisms. We describe the MPICH-G2 design and implementation, present performance results, and review application experiences, including record-setting distributed simulations.Comment: 20 pages, 8 figure

    Analyzing the performance of hierarchical collective algorithms on ARM-based multicore clusters

    Get PDF
    MPI is the de facto communication standard library for parallel applications in distributed memory architectures. Collective operations performance is critical in HPC applications as they can become the bottleneck of their executions. The advent of larger node sizes on multicore clusters has motivated the exploration of hierarchical collective algorithms aware of the process placement in the cluster and the memory hierarchy. This work analyses and compares several hierarchical collective algorithms from the literature that do not form part of the current MPI standard. We implement the algorithms on top of OpenMPI using the shared-memory facility provided by MPI-3 at the intra-node level and evaluate them on ARM-based multicore clusters. From our results, we evidence aspects of the algorithms that impact the performance and applicability of the different algorithms. Finally, we propose a model that helps us to analyze the scalability of the algorithms.This work has been supported by the Spanish Ministry of Education (PID2019-107255GB-C22) and the Generalitat de Catalunya (2017-SGR-1414).Peer ReviewedPostprint (author's final draft

    Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism

    Full text link
    Irregular communication often limits both the performance and scalability of parallel applications. Typically, applications individually implement irregular messages using point-to-point communications, and any optimizations are added directly into the application. As a result, these optimizations lack portability. There is no easy way to optimize point-to-point messages within MPI, as the interface for single messages provides no information on the collection of all communication to be performed. However, the persistent neighbor collective API, released in the MPI 4 standard, provides an interface for portable optimizations of irregular communication within MPI libraries. This paper presents methods for optimizing irregular communication within neighborhood collectives, analyzes the impact of replacing point-to-point communication in existing codebases such as Hypre BoomerAMG with neighborhood collectives, and finally shows an up to 1.32x speedup on sparse matrix-vector multiplication within a BoomerAMG solve through the use of our optimized neighbor collectives. The authors analyze multiple implementations of neighborhood collectives, including a standard implementation, which simply wraps standard point-to-point communication, as well as multiple implementations of locality-aware aggregation. All optimizations are available in an open-source codebase, MPI Advance, which sits on top of MPI, allowing for optimizations to be added into existing codebases regardless of the system MPI install
    • …
    corecore