412 research outputs found

    Software Tool Evaluation Methodology

    Get PDF
    The recent development of parallel and distributed computing software has introduced a variety of software tools that support several programming paradigms and languages. This variety of tools makes the selection of the best tool to run a given class of applications on a parallel or distributed system a non-trivial task that requires some investigation. We expect tool evaluation to receive more attention as the deployment and usage of distributed systems increases. In this paper, we present a multi-level evaluation methodology for parallel/distributed tools in which tools are evaluated from different perspectives. We apply our evaluation methodology to three message passing tools viz Express, p4, and PVM. The approach covers several important distributed systems platforms consisting of different computers (e.g., IBM-SP1, Alpha cluster, SUN workstations) interconnected by different types of networks (e.g., Ethernet, FDDI, ATM)

    A Problem Solving Environment for Network Computing

    Get PDF
    The current advances in high-speed networks and WWW technologies have made network computing a cost-effective high performance computing environment. New software development models and problem solving environments must be developed to utilize the network computing environment efficiently. In this paper we present Virtual Distributed Computing Environment (VDCE), which provides a problem solving environment for high-performance distributed computing over wide-area networks. VDCE enables scientists to develop distributed applications without knowing the detailed architecture of the underlying resources. VDCE provides well-defined library functions that relieve end users from tedious task implementations and it supports software reusability. The VDCE software architecture consists of two modules: Application Editor, and VDCE Runtime System. Application Editor is a Web-based graphical user interface that helps user to develop network applications and specifies the computing and communication properties of each task within the applications. The VDCE Runtime System schedules the individual tasks of the application to the best available resources, runs, and manages the application execution on the assigned resources. We also present how VDCE can be used as a problem solving environment and how the users can experiment and evaluate the performance of their applications for different VDCE hardware and/or software configurations

    Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

    Get PDF
    This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered

    Quantifying the Performance Differences Between PVM and TreadMarks

    Get PDF
    We compare two systems for parallel programming on networks of workstations: Parallel Virtual Machine (PVM) a message passing system, and TreadMarks, a software distributed shared memory (DSM) system. We present results for eight applications that were implemented using both systems. The programs are Water and Barnes-Hut from the SPLASH benchmark suite; 3-D FFT, Integer Sort (IS) and Embarrassingly Parallel (EP) from the NAS benchmarks; ILINK, a widely used genetic linkage analysis program; and Successive Over-Relaxation (SOR) and Traveling Salesman (TSP). Two different input data sets were used for five of the applications. We use two execution environments. The first is an 155 Mbps ATM network with eight Sparc-20 model 61 workstations; the second is an eight processor IBM SP/2. The differences in speedup between TreadMarks and PVM are dependent on the application, and, only to much a lesser extent, on the platform and the data set used. In particular, the TreadMarks speedup for six of the eight applications is within 15% of that achieved with PVM. For one application, the difference in speedup is between 15% and 30%, and for one application, the difference is around 50%. More important than the actual differences in speedups, we investigate the causes behind these differences. The cost of sending and receiving messages on current networks of workstations is very high, and previous work has identified communication costs as the primary source of overhead in software DSM implementations. The observed performance differences between PVM and TreadMarks are therefore primarily a result of differences in the amount of communication between the two systems. We identified four factors that contribute to the larger amount of communication in TreadMarks:1) extra messages due to the separation of synchronization and data transfer, 2) extra messages to handle access misses caused by the use of an invalidate protocol, 3) false sharing, and 4) d iff accumulation for migratory data. We have quantified the effect of the last three factors by measuring the performance gain when each is eliminated. Because the separation of synchronization and data transfer is a fundamental characteristic of the shared memory model, there is no way to measure its contribution to performance without completely deviating from the shared memory model. Of the three remaining factors, TreadMarks’ inability to send data belonging to different pages in a single message is the most important. The effect of false sharing is quite limited. Reducing diff accumulation benefits migratory data only when the diffs completely overlap. When these performance impediments are removed, all of the TreadMarks programs perform within 25% of PVM, and for six out of eight experiments, TreadMarks is less than 5% slower than PVM

    Scalable Parallel Computers for Real-Time Signal Processing

    Get PDF
    We assess the state-of-the-art technology in massively parallel processors (MPPs) and their variations in different architectural platforms. Architectural and programming issues are identified in using MPPs for time-critical applications such as adaptive radar signal processing. We review the enabling technologies. These include high-performance CPU chips and system interconnects, distributed memory architectures, and various latency hiding mechanisms. We characterize the concept of scalability in three areas: resources, applications, and technology. Scalable performance attributes are analytically defined. Then we compare MPPs with symmetric multiprocessors (SMPs) and clusters of workstations (COWs). The purpose is to reveal their capabilities, limits, and effectiveness in signal processing. We evaluate the IBM SP2 at MHPCC, the Intel Paragon at SDSC, the Gray T3D at Gray Eagan Center, and the Gray T3E and ASCI TeraFLOP system proposed by Intel. On the software and programming side, we evaluate existing parallel programming environments, including the models, languages, compilers, software tools, and operating systems. Some guidelines for program parallelization are provided. We examine data-parallel, shared-variable, message-passing, and implicit programming models. Communication functions and their performance overhead are discussed. Available software tools and communication libraries are also introducedpublished_or_final_versio

    An Evaluation of Software Release-Consistent Protocols

    Get PDF
    This paper presents an evaluation of three software implementations of release consistency. Release consistent protocols allow data communication to be aggregated, and multiple writers to simultaneously modify a single page. We evaluated an eager invalidate protocol that enforces consistency when synchronization variables are released, a lazy invalidate protocol that enforces consistency when synchronization variables are acquired, and a lazy hybrid protocol that selectively uses update to reduce access misses. Our evaluation is based on implementations running on DECstation-5000/240s connected by an ATM LAN, and an execution driven simulator that allows us to vary network parameters. Our results show that the lazy protocols consistently outperform the eager protocol for all but one application, and that the lazy hybrid performs the best overall. However, the relative performance of the implementations is highly dependent on the relative speeds of the network, processor, and communication software. Lower bandwidths and high per byte software communication costs favor the lazy invalidate protocol, while high bandwidths and low per byte costs favor the hybrid. Performance of the eager protocol approaches that of the lazy protocols only when communication becomes essentially free
    • …
    corecore