76,967 research outputs found

    Learning from the Success of MPI

    Full text link
    The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers. This success has occurred in spite of the view of many that message passing is difficult and that other approaches, including automatic parallelization and directive-based parallelism, are easier to use. This paper argues that MPI has succeeded because it addresses all of the important issues in providing a parallel programming model.Comment: 12 pages, 1 figur

    Performance Analysis for Mesh and Mesh-Spectral Archetype Applications

    Get PDF
    This document outlines a simple method for benchmarking a parallel communication library and for using the results to model the performance of applications developed with that communication library. We use compositional performance analysis - decomposing a parallel program into its modular parts and analyzing their respective performances - to gain perspective on the performance of the whole program. This model is useful for predicting parallel program execution times for different types of program archetypes, (e.g., mesh and mesh-spectral) using communication libraries built with different message-passing schemes (e.g., Fortran M and Fortran with MPI) running on different architectures (e.g., IBM SP2 and a network of Pentium personal computers)

    Investigation of hybrid message-passing and shared-memory architectures for parallel computer : a case study : turbonet

    Get PDF
    Several DSP (Digital Signal Processing) algorithms are developed for the MIT TurboNet parallel computer. In contrast to other parallel computers that implement exclusively in hardware either the message-passing or the shared-memory communication paradigm, or employ distributed shared-memory architectures characterized by inefficient implementation of the shared-memory paradigm, the hybrid architecture of TurboNet supports direct, efficient implementation of both paradigms. Three versions of each algorithm are developed, if possible, corresponding to message-passing, shared-memory, and hybrid communications, respectively. Theoretical and experimental comparisons of algorithms are employed in the analysis of performance. The results prove that the hybrid versions generally achieve better performance than the other two versions. The main conclusion of this research is that small-scale and medium-scale parallel computers should implement directly in hardware both communication paradigms, for high performance, robustness in relation to the application space, and ease of algorithm development. To facilitate theoretical comparisons, a methodology is developed for highly accurate prediction of algorithm performance. The success of this methodology proves that such prediction is possible for complex parallel computers, such as TurboNet, if enough information is provided by the data dependence graphs

    Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

    Get PDF
    This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered

    Parallel Computation in Econometrics: A Simplified Approach

    Get PDF
    Parallel computation has a long history in econometric computing, but is not at all wide spread. We believe that a major impediment is the labour cost of coding for parallel architectures. Moreover, programs for specific hardware often become obsolete quite quickly. Our approach is to take a popular matrix programming language (Ox), and implement a message-passing interface using MPI. Next, object-oriented programming allows us to hide the specific parallelization code, so that a program does not need to be rewritten when it is ported from the desktop to a distributed network of computers. Our focus is on so-called embarrassingly parallel computations, and we address the issue of parallel random number generation.Code optimization; Econometrics; High-performance computing; Matrix-programming language; Monte Carlo; MPI; Ox; Parallel computing; Random number generation.

    Parallel NPARC: Implementation and Performance

    Get PDF
    Version 3 of the NPARC Navier-Stokes code includes support for large-grain (block level) parallelism using explicit message passing between a heterogeneous collection of computers. This capability has the potential for significant performance gains, depending upon the block data distribution. The parallel implementation uses a master/worker arrangement of processes. The master process assigns blocks to workers, controls worker actions, and provides remote file access for the workers. The processes communicate via explicit message passing using an interface library which provides portability to a number of message passing libraries, such as PVM (Parallel Virtual Machine). A Bourne shell script is used to simplify the task of selecting hosts, starting processes, retrieving remote files, and terminating a computation. This script also provides a simple form of fault tolerance. An analysis of the computational performance of NPARC is presented, using data sets from an F/A-18 inlet study and a Rocket Based Combined Cycle Engine analysis. Parallel speedup and overall computational efficiency were obtained for various NPARC run parameters on a cluster of IBM RS6000 workstations. The data show that although NPARC performance compares favorably with the estimated potential parallelism, typical data sets used with previous versions of NPARC will often need to be reblocked for optimum parallel performance. In one of the cases studied, reblocking increased peak parallel speedup from 3.2 to 11.8

    Parallel Implementation of AES using XTS Mode of Operation

    Get PDF
    Data encryption is essential for protecting data from unauthorized access. The Advanced Encryption Standard (AES), among many encryption algorithms, is the most popular algorithm currently employed to secure static and dynamic data. There are several modes of AES operation. Each of these modes defines a unique way to perform data encryption. XTS mode is the latest mode developed to protect data stored in hard-disk-like sector-based storage devices. A recent increase in the rate of data breaches has triggered the necessity to encrypt stored data as well. AES encryption, however, is a complex process. As it involves a lot of computations, encrypting huge amount of data would undoubtedly be computationally intensive. Parallel computers have been used mostly in high-performance computation research to solve computationally intensive problems. Parallel systems are currently gaining popularity configured as general purpose multi-core system, even at a desktop level. Several programming models have been developed to assist the writing of parallel programs, and some have already been used to parallelize AES. As a result, AES data encryption has become more efficient and applicable. The message passing model is a popular parallel communication/synchronization model with an early origin. Message Passing Interface (MPI) is the first standardized, vendor-independent, message passing library interface that establishes a portable, efficient, and flexible standard for message passing during computation. Therefore, this paper describes an implementation of AES using XTS mode in parallel via MPI

    Scalable group-based checkpoint/restart for large-scale message-passing systems

    Get PDF
    The ever increasing number of processors used in parallel computers is making fault tolerance support in large-scale parallel systems more and more important. We discuss the inadequacies of existing system-level checkpointing solutions for message-passing applications as the system scales up. We analyze the coordination cost and blocking behavior of two current MPI implementations with checkpointing support. A group-based solution combining coordinated checkpointing and message logging is then proposed. Experiment results demonstrate its better performance and scalability than LAM/MPI and MPICH-VCL. To assist group formation, a method to analyze the communication behaviors of the application is proposed. ©2008 IEEE.published_or_final_versio
    corecore