4,551 research outputs found

    Optimizing MPI one-sided synchronization mechanisms on Cray's Cascade HPC systems

    Get PDF
    In this work we proposed Notified Access a new communication model that targets RDMA networks. Our focus was on optimizing producer-consumer computations, avoiding to over synchronize processes in point-to-point communications when it's not needed. We proposed a communication model in which a notification can be coupled with a single Remote Memory Access (RMA). In our model the target of an RMA operation is directly notified after the completion of a notified operation. This approach, avoiding the use of other synchronization primitives, minimizes synchronization latencies while using full hardware offload typical of high-performance networks. In order to demonstrate lower overheads than other point-to-point synchronization mechanisms, we implemented it in an open source MPI-3 library. We evaluated the performances of our implementation in a ping-pong benchmark, a computation/communication overlap benchmark and in three real-world applications: a pipeline stencil, a tree-based reduce and a task based Cholesky factorization. Our analysis shows that Notified Access is a valuable primitive for any RMA system and furthermore we show that the required hardware feature are already available in multiple state-of-the-art high-performance networks

    Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

    Get PDF
    This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered

    Using Genetic Algorithms for Building Metrics of Collaborative Systems

    Get PDF
    he paper objective is to reveal the importance of genetic algorithms in building robust metrics of collaborative systems. The main types of collaborative systems in economy are presented and some characteristics of genetic algorithms are described. A genetic algorithm was implemented in order to determine the local maximum and minimum points of the relative complexity function associated to a collaborative banking system. The intelligent collaborative systems based on genetic algorithms, representing the new generation of collaborative systems, are analyzed and the implementation of auto-adaptive interfaces in a banking application is described.Collaborative Systems, Genetic Algorithms, Metrics, Banking, Auto-Adaptive Interfaces

    Complete instrumentation requirements for performance analysis of web based technologies

    Get PDF
    In this paper we present the eDragon environment, a research platform created to perform complete performance analysis of new Web-based technologies. eDragon enables the understanding of how application servers work in both sequential and parallel platforms offering a new insight in the usage of system resources. The environment is composed of a set of instrumentation modules, a performance analysis and visualization tool and a set of experimental methodologies to perform complete performance analysis of Web-based technologies. This paper describes the design and implementation of this research platform and highlights some of its main functionalities. We will also show how a detailed analytical view can be obtained through the application of a bottom-up strategy, starting with a group of system events and advancing to more complex performance metrics using a continuous derivation process.We acknowledge the European Center for Parallelism of Barcelona (CEPBA) and CEPBA-IBM Research Institute (CIRI) for supplying the computing resources for our experiments. This work is supported by the Ministry of Science and Technology of Spain and the European Union (FEDER funds) under contract TIC2001–0995-C02–0 I and by Direcció General de Recerca of the Generalitat de Catalunya under grant 2001FI 00694 UPC APTIND.Peer ReviewedPostprint (author's final draft

    Toward Message Passing Failure Management

    Get PDF
    As machine sizes have increased and application runtimes have lengthened, research into fault tolerance has evolved alongside. Moving from result checking, to rollback recovery, and to algorithm based fault tolerance, the type of recovery being performed has changed, but the programming model in which it executes has remained virtually static since the publication of the original Message Passing Interface (MPI) Standard in 1992. Since that time, applications have used a message passing paradigm to communicate between processes, but they could not perform process recovery within an MPI implementation due to limitations of the MPI Standard. This dissertation describes a new protocol using the exiting MPI Standard called Checkpoint-on-Failure to perform limited fault tolerance within the current framework of MPI, and proposes a new platform titled User Level Failure Mitigation (ULFM) to build more complete and complex fault tolerance solutions with a true fault tolerant MPI implementation. We will demonstrate the overhead involved in using these fault tolerant solutions and give examples of applications and libraries which construct other fault tolerance mechanisms based on the constructs provided in ULFM

    Partial aggregation for collective communication in distributed memory machines

    Get PDF
    High Performance Computing (HPC) systems interconnect a large number of Processing Elements (PEs) in high-bandwidth networks to simulate complex scientific problems. The increasing scale of HPC systems poses great challenges on algorithm designers. As the average distance between PEs increases, data movement across hierarchical memory subsystems introduces high latency. Minimizing latency is particularly challenging in collective communications, where many PEs may interact in complex communication patterns. Although collective communications can be optimized for network-level parallelism, occasional synchronization delays due to dependencies in the communication pattern degrade application performance. To reduce the performance impact of communication and synchronization costs, parallel algorithms are designed with sophisticated latency hiding techniques. The principle is to interleave computation with asynchronous communication, which increases the overall occupancy of compute cores. However, collective communication primitives abstract parallelism which limits the integration of latency hiding techniques. Approaches to work around these limitations either modify the algorithmic structure of application codes, or replace collective primitives with verbose low-level communication calls. While these approaches give fine-grained control for latency hiding, implementing collective communication algorithms is challenging and requires expertise knowledge about HPC network topologies. A collective communication pattern is commonly described as a Directed Acyclic Graph (DAG) where a set of PEs, represented as vertices, resolve data dependencies through communication along the edges. Our approach improves latency hiding in collective communication through partial aggregation. Based on mathematical rules of binary operations and homomorphism, we expose data parallelism in a respective DAG to overlap computation with communication. The proposed concepts are implemented and evaluated with a subset of collective primitives in the Message Passing Interface (MPI), an established communication standard in scientific computing. An experimental analysis with communication-bound microbenchmarks shows considerable performance benefits for the evaluated collective primitives. A detailed case study with a large-scale distributed sort algorithm demonstrates, how partial aggregation significantly improves performance in data-intensive scenarios. Besides better latency hiding capabilities with collective communication primitives, our approach enables further optimizations of their implementations within MPI libraries. The vast amount of asynchronous programming models, which are actively studied in the HPC community, benefit from partial aggregation in collective communication patterns. Future work can utilize partial aggregation to improve the interaction of MPI collectives with acclerator architectures, and to design more efficient communication algorithms

    Load Balancing Algorithms for Parallel Spatial Join on HPC Platforms

    Get PDF
    Geospatial datasets are growing in volume, complexity, and heterogeneity. For efficient execution of geospatial computations and analytics on large scale datasets, parallel processing is necessary. To exploit fine-grained parallel processing on large scale compute clusters, partitioning of skewed datasets in a load-balanced way is challenging. The workload in spatial join is data dependent and highly irregular. Moreover, wide variation in the size and density of geometries from one region of the map to another, further exacerbates the load imbalance. This dissertation focuses on spatial join operation used in Geographic Information Systems (GIS) and spatial databases, where the inputs are two layers of geospatial data, and the output is a combination of the two layers according to join predicate.This dissertation introduces a novel spatial data partitioning algorithm geared towards load balancing the parallel spatial join processing. Unlike existing partitioning techniques, the proposed partitioning algorithm divides the spatial join workload instead of partitioning the individual datasets separately to provide better load-balancing. This workload partitioning algorithm has been evaluated on a high-performance computing system using real-world datasets. An intermediate output-sensitive duplication avoidance technique is proposed that decreases the external memory space requirement for storing spatial join candidates across the partitions. GPU acceleration is used to further reduce the spatial partitioning runtime. For dynamic load balancing in spatial join, a novel framework for fine-grained work stealing is presented. This framework is efficient and NUMA-aware. Performance improvements are demonstrated on shared and distributed memory architectures using threads and message passing. Experimental results show effective mitigation of data skew. The framework supports a variety of spatial join predicates and spatial overlay using partitioned and un-partitioned datasets