2,521 research outputs found

    Performance evaluation of an open distributed platform for realistic traffic generation

    Get PDF
    Network researchers have dedicated a notable part of their efforts to the area of modeling traffic and to the implementation of efficient traffic generators. We feel that there is a strong demand for traffic generators capable to reproduce realistic traffic patterns according to theoretical models and at the same time with high performance. This work presents an open distributed platform for traffic generation that we called distributed internet traffic generator (D-ITG), capable of producing traffic (network, transport and application layer) at packet level and of accurately replicating appropriate stochastic processes for both inter departure time (IDT) and packet size (PS) random variables. We implemented two different versions of our distributed generator. In the first one, a log server is in charge of recording the information transmitted by senders and receivers and these communications are based either on TCP or UDP. In the other one, senders and receivers make use of the MPI library. In this work a complete performance comparison among the centralized version and the two distributed versions of D-ITG is presented

    Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness

    Get PDF
    This work presents and evaluates algorithms for MPI collective communication operations on high performance systems. Collective communication algorithms are extensively investigated, and a universal algorithm to improve the performance of MPI collective operations on hierarchical clusters is introduced. This algorithm exploits shared-memory buffers for efficient intra-node communication while still allowing the use of unmodified, hierarchy-unaware traditional collectives for inter-node communication. The universal algorithm shows impressive performance results with a variety of collectives, improving upon the MPICH algorithms as well as the Cray MPT algorithms. Speedups average 15x - 30x for most collectives with improved scalability up to 65536 cores.^ Further novel improvements are also proposed for inter-node communication. By utilizing algorithms which take advantage of multiple senders from the same shared memory buffer, an additional speedup of 2.5x can be achieved. The discussion also evaluates special-purpose extensions to improve intra-node communication. These extensions return a shared memory or copy-on-write protected buffer from the collective, which reduces or completely eliminates the second phase of intra-node communication.^ The second part of this work improves the performance of MPI collective communication operations in the presence of imbalanced processes arrival times. High performance collective communications are crucial for the performance and scalability of applications, and imbalanced process arrival times are common in these applications. A micro-benchmark is used to investigate the nature of process imbalance with perfectly balanced workloads, and understand the nature of inter- versus intra-node imbalance. These insights are then used to develop imbalance tolerant reduction, broadcast, and alltoall algorithms, which minimize the synchronization delay observed by early arriving processes. These algorithms have been implemented and tested on a Cray XE6 using up to 32k cores with varying buffer sizes and levels of imbalance. Results show speedups over MPICH averaging 18.9x for reduce, 5.3x for broadcast, and 6.9x for alltoall in the presence of high, but not unreasonable, imbalance

    Design and Implementation of MPICH2 over InfiniBand with RDMA Support

    Full text link
    For several years, MPI has been the de facto standard for writing parallel applications. One of the most popular MPI implementations is MPICH. Its successor, MPICH2, features a completely new design that provides more performance and flexibility. To ensure portability, it has a hierarchical structure based on which porting can be done at different levels. In this paper, we present our experiences designing and implementing MPICH2 over InfiniBand. Because of its high performance and open standard, InfiniBand is gaining popularity in the area of high-performance computing. Our study focuses on optimizing the performance of MPI-1 functions in MPICH2. One of our objectives is to exploit Remote Direct Memory Access (RDMA) in Infiniband to achieve high performance. We have based our design on the RDMA Channel interface provided by MPICH2, which encapsulates architecture-dependent communication functionalities into a very small set of functions. Starting with a basic design, we apply different optimizations and also propose a zero-copy-based design. We characterize the impact of our optimizations and designs using microbenchmarks. We have also performed an application-level evaluation using the NAS Parallel Benchmarks. Our optimized MPICH2 implementation achieves 7.6 μ\mus latency and 857 MB/s bandwidth, which are close to the raw performance of the underlying InfiniBand layer. Our study shows that the RDMA Channel interface in MPICH2 provides a simple, yet powerful, abstraction that enables implementations with high performance by exploiting RDMA operations in InfiniBand. To the best of our knowledge, this is the first high-performance design and implementation of MPICH2 on InfiniBand using RDMA support.Comment: 12 pages, 17 figure

    Partial aggregation for collective communication in distributed memory machines

    Get PDF
    High Performance Computing (HPC) systems interconnect a large number of Processing Elements (PEs) in high-bandwidth networks to simulate complex scientific problems. The increasing scale of HPC systems poses great challenges on algorithm designers. As the average distance between PEs increases, data movement across hierarchical memory subsystems introduces high latency. Minimizing latency is particularly challenging in collective communications, where many PEs may interact in complex communication patterns. Although collective communications can be optimized for network-level parallelism, occasional synchronization delays due to dependencies in the communication pattern degrade application performance. To reduce the performance impact of communication and synchronization costs, parallel algorithms are designed with sophisticated latency hiding techniques. The principle is to interleave computation with asynchronous communication, which increases the overall occupancy of compute cores. However, collective communication primitives abstract parallelism which limits the integration of latency hiding techniques. Approaches to work around these limitations either modify the algorithmic structure of application codes, or replace collective primitives with verbose low-level communication calls. While these approaches give fine-grained control for latency hiding, implementing collective communication algorithms is challenging and requires expertise knowledge about HPC network topologies. A collective communication pattern is commonly described as a Directed Acyclic Graph (DAG) where a set of PEs, represented as vertices, resolve data dependencies through communication along the edges. Our approach improves latency hiding in collective communication through partial aggregation. Based on mathematical rules of binary operations and homomorphism, we expose data parallelism in a respective DAG to overlap computation with communication. The proposed concepts are implemented and evaluated with a subset of collective primitives in the Message Passing Interface (MPI), an established communication standard in scientific computing. An experimental analysis with communication-bound microbenchmarks shows considerable performance benefits for the evaluated collective primitives. A detailed case study with a large-scale distributed sort algorithm demonstrates, how partial aggregation significantly improves performance in data-intensive scenarios. Besides better latency hiding capabilities with collective communication primitives, our approach enables further optimizations of their implementations within MPI libraries. The vast amount of asynchronous programming models, which are actively studied in the HPC community, benefit from partial aggregation in collective communication patterns. Future work can utilize partial aggregation to improve the interaction of MPI collectives with acclerator architectures, and to design more efficient communication algorithms

    Improving the performance of parallel scientific applications using cache injection

    Get PDF
    Cache injection is a viable technique to improve the performance of data-intensive parallel applications. This dissertation characterizes cache injection of incoming network data in terms of parallel application performance. My results show that the benefit of this technique is dependent on: the ratio of processor speed to memory speed, the cache injection policy, and the application\u27s communication characteristics. Cache injection addresses the memory wall for I/O by writing data into a processor\u27s cache directly from the I/O bus. This technique, unlike data prefetching, reduces the number of reads served by the memory unit. This reduction is significant for data-intensive applications whose performance is dominated by compulsory cache misses and cannot be alleviated by traditional caching systems. Unlike previous work on cache injection which focused on reducing host network stack overhead incurred by memory copies, I show that applications can directly benefit from this technique based on their temporal and spatial locality in accessing incoming network data. I also show that the performance of cache injection is directly proportional to the ratio of processor speed to memory speed. In other words, systems with a memory wall can provide significantly better performance with cache injection and an appropriate injection policy. This result implies that multi-core and many-core architectures would benefit from this technique. Finally, my results show that the application\u27s communication characteristics are key to cache injection performance. For example, cache injection can improve the performance of certain collective communication operations by up to 20% as a function of message size

    Evaluating the performance of the allreduce collective operation on clusters. Approach and results

    Get PDF
    The performance of the collective operations provided by a communication library is important for many applications run on clusters. The communication structure of collective operations can be organized as a tree. Performance can be improved by configuring and mapping the tree to the clusters in use. We describe and demonstrate an approach for evaluating the performance of different configurations and mappings of allreduce run on clusters of different size, consisting of single-CPU hosts, and SMPs with a different number of CPUs. A breakdown of the cost of allreduce using the best configuration on different clusters is provided. For all, the broadcast part is more expensive than the reduce part. Inter-host communication contributes more to the time per allreduce than the synchronization in the allreduce components. For the small messages sizes used (4 and 256 bytes), the time spent computing the partial reductions is insignificant. Reconfiguring hierarchy aware trees improved performance up to a factor of 1.49, by avoiding scalability problems of the components on SMPs, and by finding the right balance between available concurrency, load on 'root' hosts and the number of network links in a tree. Extending a tree by adding more threads, or by combining two trees does not have a negative influence on the performance of a configuration, but increasing message size does

    Runtime MPI Correctness Checking with a Scalable Tools Infrastructure

    Get PDF
    Increasing computational demand of simulations motivates the use of parallel computing systems. At the same time, this parallelism poses challenges to application developers. The Message Passing Interface (MPI) is a de-facto standard for distributed memory programming in high performance computing. However, its use also enables complex parallel programing errors such as races, communication errors, and deadlocks. Automatic tools can assist application developers in the detection and removal of such errors. This thesis considers tools that detect such errors during an application run and advances them towards a combination of both precise checks (neither false positives nor false negatives) and scalability. This includes novel hierarchical checks that provide scalability, as well as a formal basis for a distributed deadlock detection approach. At the same time, the development of parallel runtime tools is challenging and time consuming, especially if scalability and portability are key design goals. Current tool development projects often create similar tool components, while component reuse remains low. To provide a perspective towards more efficient tool development, which simplifies scalable implementations, component reuse, and tool integration, this thesis proposes an abstraction for a parallel tools infrastructure along with a prototype implementation. This abstraction overcomes the use of multiple interfaces for different types of tool functionality, which limit flexible component reuse. Thus, this thesis advances runtime error detection tools and uses their redesign and their increased scalability requirements to apply and evaluate a novel tool infrastructure abstraction. The new abstraction ultimately allows developers to focus on their tool functionality, rather than on developing or integrating common tool components. The use of such an abstraction in wide ranges of parallel runtime tool development projects could greatly increase component reuse. Thus, decreasing tool development time and cost. An application study with up to 16,384 application processes demonstrates the applicability of both the proposed runtime correctness concepts and of the proposed tools infrastructure
    • …
    corecore