38 research outputs found

    Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

    Get PDF

    On the conditions for efficient interoperability with threads: An experience with PGAS languages using Cray communication domains

    Get PDF
    Today's high performance systems are typically built from shared memory nodes connected by a high speed network. That architecture, combined with the trend towards less memory per core, encourages programmers to use a mixture of message passing and multithreaded programming. Unfortunately, the advantages of using threads for in-node programming are hindered by their inability to efficiently communicate between nodes. In this work, we identify some of the performance problems that arise in such hybrid programming environments and characterize conditions needed to achieve high communication performance for multiple threads: addressability of targets, separability of communication paths, and full direct reachability to targets. Using the GASNet communication layer on the Cray XC30 as our experimental platform, we show how to satisfy these conditions. We also discuss how satisfying these conditions is influenced by the communication abstraction, implementation constraints, and the interconnect messaging capabilities. To evaluate these ideas, we compare the communication performance of a thread-based node runtime to a process-based runtime. Without our GASNet extensions, thread communication is significantly slower than processes - up to 21x slower. Once the implementation is modified to address each of our conditions, the two runtimes have comparable communication performance. This allows programmers to more easily mix models like OpenMP, CILK, or pthreads with a GASNet-based model like UPC, with the associated performance, convenience and interoperability advantages that come from using threads within a node. © 2014 ACM

    Optimizing Collective Communication for Scalable Scientific Computing and Deep Learning

    Get PDF
    In the realm of distributed computing, collective operations involve coordinated communication and synchronization among multiple processing units, enabling efficient data exchange and collaboration. Scientific applications, such as simulations, computational fluid dynamics, and scalable deep learning, require complex computations that can be parallelized across multiple nodes in a distributed system. These applications often involve data-dependent communication patterns, where collective operations are critical for achieving high performance in data exchange. Optimizing collective operations for scientific applications and deep learning involves improving the algorithms, communication patterns, and data distribution strategies to minimize communication overhead and maximize computational efficiency. Within the context of this dissertation, the specific focus is on optimizing the alltoall operation in 3D Fast Fourier Transform (FFT) applications and the allreduce operation in parallel deep learning, particularly on High-Performance Computing (HPC) systems. Advanced communication algorithms and methods are explored and implemented to improve communication efficiency, consequently enhancing the overall performance of 3D FFT applications. Furthermore, this dissertation investigates the identification of performance bottlenecks during collective communication over Horovod on distributed systems. These bottlenecks are addressed by proposing an optimized parallel communication pattern specifically tailored to alleviate the aforementioned limitations during the training phase in distributed deep learning. The objective is to achieve faster convergence and improve the overall training efficiency. Moreover, this dissertation proposes fault tolerance and elastic scaling features for distributed deep learning by leveraging the User-Level Failure Mitigation (ULFM) from Message Passing Interface (MPI). By incorporating ULFM MPI, the dissertation aims to enhance the elastic capabilities of distributed deep learning systems. This approach enables graceful and lightweight handling of failures while facilitating seamless scaling in dynamic computing environments

    ParBiBit: Parallel tool for binary biclustering on modern distributed-memory systems

    Get PDF
    [Abstract]: Biclustering techniques are gaining attention in the analysis of large-scale datasets as they identify two-dimensional submatrices where both rows and columns are correlated. In this work we present ParBiBit, a parallel tool to accelerate the search of interesting biclusters on binary datasets, which are very popular on different fields such as genetics, marketing or text mining. It is based on the state-of-the-art sequential Java tool BiBit, which has been proved accurate by several studies, especially on scenarios that result on many large biclusters. ParBiBit uses the same methodology as BiBit (grouping the binary information into patterns) and provides the same results. Nevertheless, our tool significantly improves performance thanks to an efficient implementation based on C++11 that includes support for threads and MPI processes in order to exploit the compute capabilities of modern distributed-memory systems, which provide several multicore CPU nodes interconnected through a network. Our performance evaluation with 18 representative input datasets on two different eight-node systems shows that our tool is significantly faster than the original BiBit. Source code in C++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/parbibit/.This work was supported by the Ministry of Economy, Industry and Competitiveness of Spain and FEDER funds of the European Union [grant TIN2016-75845-P (AEI/FEDER/UE)], as well as by Xunta de Galicia (Centro Singular de Investigacion de Galicia accreditation 2016-2019, ref. EDG431G/01).Xunta de Galicia; EDG431G/0

    Runtime support for irregular computation in MPI-based applications

    Get PDF
    In recent years there are increasing number of applications that have been using irregular computation models in various domains, such as computational chemistry, bioinformatics, nuclear reactor simulation and social network analysis. Due to the irregular and data-dependent communication patterns and sparse data structures involved in those applications, the traditional parallel programming model and runtime need to be carefully designed and implemented in order to accommodate the performance and scalability requirements of those irregular applications on large-scale systems. The Message Passing Interface (MPI) is the industry standard communication library for high performance computing. However, whether MPI can serve as a suitable programming model / runtime for irregular applications or not is one of the most debated aspects in the community. The goal of this thesis is to investigate the suitability of MPI to irregular applications. This thesis consists of two subtopics. The first subtopic focuses on improving MPI runtime to support the irregular applications from perspective of scalability and performance. The first three parts in this subtopic focus on MPI one-sided communication. In the first part, we present a thorough survey of current MPI one-sided implementations and illustrate scalability limitations in those implementations. In the second part, we propose a new design and implementation of MPI one-sided communication, called ScalaRMA, to effectively address those scalability limitations. The third part in this subtopic focuses on various issuing strategies in MPI one-sided communication. We propose an adaptive issuing strategy which can adaptively choose between delayed issuing strategy and eager issuing strategy in MPI runtime to achieve high performance based on current communication volume in MPI-based application. The last part in this subtopic is to tackle the scalability limitations in the virtual connection (VC) objects in MPI implementation. We propose a scalable design to reduce the memory consumption of VC objects in MPI runtime. The second subtopic of this thesis focuses on improving MPI programming model to better support the irregular applications. Traditional two-sided data movement model in MPI standard designed for scientific computation provides a paradigm for user to specify how to move the data between processes, however, it does not provide interface to flexibly manage the computation, which means user needs to explicitly manage where the computation should be performed. This model is not well suited for irregular applications which involve irregular and data-dependent communication pattern. In this work, we combine Active Messages (AM), an alternative programming paradigm which is more suitable for irregular computations, with traditional MPI data movement model, and propose a generalized MPI-interoperable Active Messages framework (MPI-AM). The framework allows MPI-based applications to incrementally use AMs only when necessary, avoiding rewriting the entire MPI-based application. Such framework integrates data movement and computation together in the programming model and MPI can coordinate the computation and communication in a much more flexible manner. In this subtopic, we propose several strategies including message streaming, buffer management and asynchronous processing, in order to efficiently handle AMs inside MPI. We also propose subtle correctness semantics of MPI-AM to define how AMs can work correctly with other MPI messages in the system, from perspectives of memory consistency, concurrency, ordering and atomicity

    Optimizing MPI one-sided synchronization mechanisms on Cray's Cascade HPC systems

    Get PDF
    In this work we proposed Notified Access a new communication model that targets RDMA networks. Our focus was on optimizing producer-consumer computations, avoiding to over synchronize processes in point-to-point communications when it's not needed. We proposed a communication model in which a notification can be coupled with a single Remote Memory Access (RMA). In our model the target of an RMA operation is directly notified after the completion of a notified operation. This approach, avoiding the use of other synchronization primitives, minimizes synchronization latencies while using full hardware offload typical of high-performance networks. In order to demonstrate lower overheads than other point-to-point synchronization mechanisms, we implemented it in an open source MPI-3 library. We evaluated the performances of our implementation in a ping-pong benchmark, a computation/communication overlap benchmark and in three real-world applications: a pipeline stencil, a tree-based reduce and a task based Cholesky factorization. Our analysis shows that Notified Access is a valuable primitive for any RMA system and furthermore we show that the required hardware feature are already available in multiple state-of-the-art high-performance networks

    Improving MPI Threading Support for Current Hardware Architectures

    Get PDF
    Threading support for Message Passing Interface (MPI) has been defined in the MPI standard for more than twenty years. While many standard-compliance MPI implementations fully support multithreading, the threading support in MPI still cannot provide the optimal performance on the same level as the non-threading environment. The performance disparity leads to low adoption rate from applications, and eventually, lesser interest in optimizing MPI threading support. However, with the current advancement in computation hardware, the number of CPU core per packet is growing drastically. Using shared-memory MPI communication has become more costly. MPI threading without local communication is one of the alternatives and the some interests are shifting back toward threading to MPI.In this work, we investigate different approaches to leverage the power of thread parallelism and tools to help us to raise the multi-threaded MPI performance to reasonable level. We propose a novel multi-threaded MPI benchmark with multiple communication patterns to stress multiple points of the MPI implementation, with the ability to switch between using MPI process and threads for quick comparison between two modes. Enabling the us, and the others MPI developers to stress test their implementation design.We address the interoperability between MPI implementation and threading frameworks by introducing the thread-synchronization object, an object that gives the MPI implementation more control over user-level thread, allowing for more thread utilization in MPI. In our implementation, the synchronization object relieves the lock contention on the internal progress engine and able to achieve up to 7x the performance of the original implementation. Moving forward, we explore the possibility of harnessing the true thread concurrency. We proposed several strategies to address the bottlenecks in MPI implementation. From our evaluation, with our novel threading optimization, we can achieve up to 22x the performance comparing to the legacy MPI designs

    Scalability Engineering for Parallel Programs Using Empirical Performance Models

    Get PDF
    Performance engineering is a fundamental task in high-performance computing (HPC). By definition, HPC applications should strive for maximum performance. As HPC systems grow larger and more complex, the scalability of an application has become of primary concern. Scalability is the ability of an application to show satisfactory performance even when the number of processors or the problems size is increased. Although various analysis techniques for scalability were suggested in past, engineering applications for extreme-scale systems still occurs ad hoc. The challenge is to provide techniques that explicitly target scalability throughout the whole development cycle, thereby allowing developers to uncover bottlenecks earlier in the development process. In this work, we develop a number of fundamental approaches in which we use empirical performance models to gain insights into the code behavior at higher scales. In the first contribution, we propose a new software engineering approach for extreme-scale systems. Specifically, we develop a framework that validates asymptotic scalability expectations of programs against their actual behavior. The most important applications of this method, which is especially well suited for libraries encapsulating well-studied algorithms, include initial validation, regression testing, and benchmarking to compare implementation and platform alternatives. We supply a tool-chain that automates large parts of the framework, thus allowing it to be continuously applied throughout the development cycle with very little effort. We evaluate the framework with MPI collective operations, a data-mining code, and various OpenMP constructs. In addition to revealing unexpected scalability bottlenecks, the results also show that it is a viable approach for systematic validation of performance expectations. As the second contribution, we show how the isoefficiency function of a task-based program can be determined empirically and used in practice to control the efficiency. Isoefficiency, a concept borrowed from theoretical algorithm analysis, binds efficiency, core count, and the input size in one analytical expression, thereby allowing the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we analyze resource contention by modeling the efficiency of contention-free execution. This allows poor scaling to be attributed either to excessive resource contention overhead or structural conflicts related to task dependencies or scheduling. Our results, obtained with applications from two benchmark suites, demonstrate that our approach provides insights into fundamental scalability limitations or excessive resource overhead and can help answer critical co-design questions. Our contributions for better scalability engineering can be used not only in the traditional software development cycle, but also in other, related fields, such as algorithm engineering. It is a field that uses the software engineering cycle to produce algorithms that can be utilized in applications more easily. Using our contributions, algorithm engineers can make informed design decisions, get better insights, and save experimentation time
    corecore