2,907 research outputs found

    Towards Automatic and Adaptive Optimizations of MPI Collective Operations

    Get PDF
    Message passing is one of the most commonly used paradigms of parallel programming. Message Passing Interface, MPI, is a standard used in scientific and high-performance computing. Collective operations are a subset of MPI standard that deals with processes synchronization, data exchange and computation among a group of processes. The collective operations are commonly used and can be application performance bottleneck. The performance of collective operations depends on many factors, some of which are the input parameters (e.g., communicator and message size); system characteristics (e.g., interconnect type); the application computation and communication pattern; and internal algorithm parameters (e.g., internal segment size). We refer to an algorithm and its internal parameters as a method. The goal of this dissertation is a performance improvement of MPI collective operations and applications that use them. In our framework, during a collective call, a system-specific decision function is invoked to select the most appropriate method for the particular collective instance. This dissertation focuses on automatic techniques for system-specific decision function generation. Our approach takes the following steps: first, we collect method performance information on the system of interest; second, we analyze this information using parallel communication models, graphical encoding methods, and decision trees; third, based on the previous step, we automatically generate the system-specific decision function to be used at run-time. In situation when a detailed performance measurement is not feasible, method performance models can be used to supplement the measured method performance information. We build and evaluate parallel communication models of 35 different collective algorithms. These models are built on top of the three commonly used point-to-point communication models, Hockney, LogGP, and PLogP.We use the method performance information on a system to build quadtrees and C4.5 decision trees of variable sizes and accuracies. The collective method selection functions are then generated automatically from these trees. Our experiments show that quadtrees of three or four levels are often enough to approximate experimentally optimal decision with a small mean performance penalty (less than 10%). The C4.5 decision trees are even more accurate (with mean performance penalty of less than 5%). The size and accuracy of C4.5 decision trees can be further improved with use of appropriate composite attributes (such as ā€œtotal message sizeā€, or ā€œeven communicator sizeā€.) Finally, we apply these techniques to tune the collective operations on the Grig cluster at the University of Tennessee and to improve an application performance on the Cray XT4 system at Oak Ridge National Laboratory. The tuned collective is able to achieve more than 40% mean performance improvement over the native broadcast implementation. Using the platform-specific reduce on Cray XT4 lead to 10% improvement in the overall application performance. Our results show that the methods we explored are both applicable and effective for the system-specific optimizations of collective operations and are a right step toward automatically tunable, adaptive, MPI collectives

    The Capacity of Smartphone Peer-To-Peer Networks

    Get PDF
    We study three capacity problems in the mobile telephone model, a network abstraction that models the peer-to-peer communication capabilities implemented in most commodity smartphone operating systems. The capacity of a network expresses how much sustained throughput can be maintained for a set of communication demands, and is therefore a fundamental bound on the usefulness of a network. Because of this importance, wireless network capacity has been active area of research for the last two decades. The three capacity problems that we study differ in the structure of the communication demands. The first problem is pairwise capacity, where the demands are (source, destination) pairs. Pairwise capacity is one of the most classical definitions, as it was analyzed in the seminal paper of Gupta and Kumar on wireless network capacity. The second problem we study is broadcast capacity, in which a single source must deliver packets to all other nodes in the network. Finally, we turn our attention to all-to-all capacity, in which all nodes must deliver packets to all other nodes. In all three of these problems we characterize the optimal achievable throughput for any given network, and design algorithms which asymptotically match this performance. We also study these problems in networks generated randomly by a process introduced by Gupta and Kumar, and fully characterize their achievable throughput. Interestingly, the techniques that we develop for all-to-all capacity also allow us to design a one-shot gossip algorithm that runs within a polylogarithmic factor of optimal in every graph. This largely resolves an open question from previous work on the one-shot gossip problem in this model

    Mapping applications with collectives over sub-communicators on torus networks

    Get PDF
    pre-printThe placement of tasks in a parallel application on specific nodes of a supercomputer can significantly impact performance. Traditionally, this task mapping has focused on reducing the distance between communicating tasks on the physical network. This minimizes the number of hops that point-to-point messages travel and thus reduces link sharing between messages and contention. However, for applications that use collectives over sub-communicators, this heuristic may not be optimal. Many collectives can benefit from an increase in bandwidth even at the cost of an increase in hop count, especially when sending large messages. For example, placing communicating tasks in a cube configuration rather than a plane or a line on a torus network increases the number of possible paths messages might take. This increases the available bandwidth which can lead to significant performance gains. We have developed Rubik, a tool that provides a simple and intuitive interface to create a wide variety of mappings for structured communication patterns. Rubik supports a number of elementary operations such as splits, tilts, or shifts, that can be combined into a large number of unique patterns. Each operation can be applied to disjoint groups of processes involved in collectives to increase the effective bandwidth. We demonstrate the use of Rubik for improving performance of two parallel codes, pF3D and Qbox, which use collectives over sub-communicators

    Communications systems technology assessment study. Volume 2: Results

    Get PDF
    The cost and technology characteristics are examined for providing special satellite services at UHF, 2.5 GHz, and 14/12 GHz. Considered are primarily health, educational, informational and emergency disaster type services. The total cost of each configuration including space segment, earth station, installation operation and maintenance was optimized to reduce the user's total annual cost and establish preferred equipment performance parameters. Technology expected to be available between now and 1985 is identified and comparisons made between selected alternatives. A key element of the study is a survey of earth station equipment updating past work in the field, providing new insight into technology, and evaluating production and test methods that can reduce costs in large production runs. Various satellite configurations were examined. The cost impact of rain attenuation at Ku-band was evaluated. The factors affecting the ultimate capacity achievable with the available orbital arc and available bandwidth were analyzed

    WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System

    Full text link
    Communication efficiency plays an important role in accelerating the distributed training of Deep Neural Networks (DNN). All-reduce is the key communication primitive to reduce model parameters in distributed DNN training. Most existing all-reduce algorithms are designed for traditional electrical interconnect systems, which cannot meet the communication requirements for distributed training of large DNNs. One of the promising alternatives for electrical interconnect is optical interconnect, which can provide high bandwidth, low transmission delay, and low power cost. We propose an efficient scheme called WRHT (Wavelength Reused Hierarchical Tree) for implementing all-reduce operation in optical interconnect system, which can take advantage of WDM (Wavelength Division Multiplexing) to reduce the communication time of distributed data-parallel DNN training. We further derive the minimum number of communication steps and communication time to realize the all-reduce using WRHT. Simulation results show that the communication time of WRHT is reduced by 75.59%, 49.25%, and 70.1% respectively compared with three traditional all-reduce algorithms simulated in optical interconnect system. Simulation results also show that WRHT can reduce the communication time for all-reduce operation by 86.69% and 84.71% in comparison with two existing all-reduce algorithms in electrical interconnect system.Comment: This paper is under the submission of GLOBECOM 202
    • ā€¦
    corecore