51 research outputs found

    Performance of MPI on the CRAY T3E-512

    Get PDF
    The CRAY T3E-512 is currently the most powerful machine available at RUS/hww. Although it provides support for shared memory the natural programming model for the machine is message passing. Since RUS has decided to support primarily the MPI standard we have found it useful to test the performance of MPI on the machine for several standard message passing constructs

    Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks

    Get PDF
    The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems

    Automatic Profiling of MPI Applications with Hardware Performance Counters

    No full text
    This paper presents an automatic counter instrumentation and pro ling module added to the MPI library on Cray T3E and SGI Origin2000 systems. A detailed summary of the hardware performance counters and the MPI calls of any MPI production program is gathered during execution and written in MPI Finalize on a special syslog file. The user can get the same information in a different file. Statistical summaries are computed weekly and monthly. The paper describes experiences with this library on the Cray T3E systems at HLRS Stuttgart and TU Dresden. It focuses on the problems integrating the hardware performance counters into MPI counter profiling and presents first results with these counters. Also, a second software design is described that allows the integration of the pro ling layer into a dynamic shared object MPI library without consuming the user's PMPI profiling interface

    Balance of HPC Systems Based on HPCC Benchmark Results

    No full text
    Abstract. Based on results reported by the HPC Challenge benchmark suite (HPCC), the balance between computational speed, communication bandwidth, and memory bandwidth is analyzed for HPC systems from Cray, NEC, IBM, and other vendors, and clusters with various network interconnects. Strength and weakness of the communication interconnect is examined for three communication patterns. The HPCC suite was released to analyze the performance of high-performance computing architectures using several kernels to measure different memory and hardware access patterns comprising latency based measurements, memory streaming, inter-process communication and floating point computation. HPCC defines a set of benchmarks augmenting the High Performance Linpack used in the Top500 list. This paper describes the inter-process communication benchmarks of this suite. Based on the effective bandwidth benchmark, a special parallel random and natural ring communication benchmark has been developed for HPCC. Ping-Pong benchmarks on a set of process pairs can be used for further characterization of a system. This paper analyzes first results achieved with HPCC. The focus of this paper is on the balance between computational speed, memory bandwidth, and inter-node communication. Keywords. HPCC, network bandwidth, effective bandwidth, Linpack

    Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

    No full text
    Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside each node. The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming models on hybrid architectures. The paper focuses on bandwidth and latency aspects, and also on whether programming paradigms can separate the optimization of communication and computation. Benchmark results are presented for hybrid and pure MPI communication. This paper analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes

    The controlled logical clock, a global clock for trace-based monitoring of parallel applications

    No full text
    Das Aufzeichnen und Darstellen des Programmflusses sowie des Nachrichtenaustauschs paralleler Anwendungen ist schwierig, wenn jeder Prozessor eine eigene Uhr besitzt, und diese Uhren nicht synchronisiert sind. Mehrere Strategien zur Bildung einer globalen Uhrzeit werden in einem Überblick dargestellt, und die Grenzen werden aufgezeigt. Die geregelte logische Uhr, eine neue Methode auf der Basis von Lamports logischer Uhr, wird vorgestellt. Ungenaue Zeitstempel aus Tracefiles werden derart modifiziert, daß sie die Uhrenbedingung erfüllen, d.h. daß der Empfang einer Nachricht einen späteren Zeitstempel als das zugehörige Sendeereignis besitzt. Mit dem Regler wird das Maximum aller lokalen Prozessoruhren als Basis für eine globale Zeit angenähert. Die korrigierten Zeitstempel ermöglichen Leistungsmessungen, bei denen die Ereignisse in verschiedenen Prozessen liegen. Eine stückweise lineare rückwärtige Amortisation der Uhrenkorrekturen garantiert, daß die Fehler bei Messungen von Zeitintervallen zwischen Ereignissen im selben Prozeß minimal sind. Bei der Erstellung eines Tracefiles ist kein zusätzlicher Protokollaufwand nötig. Die geregelte logische Uhr kann als Filter für Tracefiles implementiert werden. Sie kann aber auch in Monitor- und Debuggingwerkzeuge integriert werden.Event tracing and monitoring of the program flow and the message exchanges of parallel applications are difficult if each processor has its own unsynchronized clock. A survey of several strategies to generate a global time is given, and the limits are discussed. The controlled logical clock is presented. It is a new method based on Lamport's logical clock and provides a method to modify inexact timestamps of tracefiles. The new timestamps guarantee the clock condition, i.e. that the receive event of a message has a later timestamp than the send event. With the control algorithm an approximation of the maximum of all local processor clocks is used as global time. The corrected timestamps can also be used for performance measurements with pairs of events in different processes. A piecewise linear backward amortisation of the clock corrections guarantees a minimal error for measurements of time intervals between events in the same process. No additional protocol overhead is needed for the new method while tracing the application. The method can be implemented as a filter for tracefiles or it can be integrated into monitor and debug tools for parallel applications

    Hybrid Parallel Programming: Performance Problems and Chances

    No full text
    This paper analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. Benchmark results show, that the hybridmasteronly programming model can be used more e- ciently on some vector-type systems, although this model suers from sleeping application threads while the master thread communicates. This paper analyses strategies to overcome typical drawbacks of this easily usable programming scheme on systems with weaker inter-connects. Best performance can be achieved with overlapping communication and computation, but this scheme is lacking in ease of us

    The controlled logical clock - a global time for trace based software monitoring of parallel applications in workstation clusters

    No full text
    Copyright 1996 IEEE. Copies may not used in any way that implies IEEE endorsement of a product or service of an employer
    corecore