Search CORE

51 research outputs found

Performance of MPI on the CRAY T3E-512

Author: Berger Holger
Bönisch Thomas
Rabenseifner Rolf
Resch Michael
Publication venue
Publication date: 07/02/2013
Field of study

The CRAY T3E-512 is currently the most powerful machine available at RUS/hww. Although it provides support for shared memory the natural programming model for the machine is message passing. Since RUS has decided to support primarily the MPI standard we have found it useful to test the performance of MPI on the machine for several standard message passing constructs

MPI Semantic Terms and Conventions Explained:The Big Idea: Understanding Semantic Terms and Conventions is Key to Using, Extending, and Implementing MPI Correctly

Author: Bangalore Purushotham V.
Blaas-Schenner Claudia
Holmes Daniel
Jaeger Julien
Mercier Guillaume
Rabenseifner Rolf
Skjellum Anthony
Publication venue
Publication date: 11/09/2019
Field of study

Edinburgh Research Explorer

Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks

Author: Adamidis Panagiotis
Ciotti Robert
Dossa Don
Fatoohi Rod
Gunney Brian T. N.
Koniges Alice
Mueller Matthias
Rabenseifner Rolf
Saini Subhash
Spelce Thomas E.
Tiyyagura Sunil R.
Publication venue
Publication date: 01/01/2006
Field of study

The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems

CiteSeerX

Crossref

NASA Technical Reports Server

Automatic Profiling of MPI Applications with Hardware Performance Counters

Author: Rolf Rabenseifner
Publication venue
Publication date: 01/01/1999
Field of study

This paper presents an automatic counter instrumentation and pro ling module added to the MPI library on Cray T3E and SGI Origin2000 systems. A detailed summary of the hardware performance counters and the MPI calls of any MPI production program is gathered during execution and written in MPI Finalize on a special syslog file. The user can get the same information in a different file. Statistical summaries are computed weekly and monthly. The paper describes experiences with this library on the Cray T3E systems at HLRS Stuttgart and TU Dresden. It focuses on the problems integrating the hardware performance counters into MPI counter profiling and presents first results with these counters. Also, a second software design is described that allows the integration of the pro ling layer into a dynamic shared object MPI library without consuming the user's PMPI profiling interface

CiteSeerX

Balance of HPC Systems Based on HPCC Benchmark Results

Author: Rolf Rabenseifner
Publication venue
Publication date
Field of study

Abstract. Based on results reported by the HPC Challenge benchmark suite (HPCC), the balance between computational speed, communication bandwidth, and memory bandwidth is analyzed for HPC systems from Cray, NEC, IBM, and other vendors, and clusters with various network interconnects. Strength and weakness of the communication interconnect is examined for three communication patterns. The HPCC suite was released to analyze the performance of high-performance computing architectures using several kernels to measure different memory and hardware access patterns comprising latency based measurements, memory streaming, inter-process communication and floating point computation. HPCC defines a set of benchmarks augmenting the High Performance Linpack used in the Top500 list. This paper describes the inter-process communication benchmarks of this suite. Based on the effective bandwidth benchmark, a special parallel random and natural ring communication benchmark has been developed for HPCC. Ping-Pong benchmarks on a set of process pairs can be used for further characterization of a system. This paper analyzes first results achieved with HPCC. The focus of this paper is on the balance between computational speed, memory bandwidth, and inter-node communication. Keywords. HPCC, network bandwidth, effective bandwidth, Linpack

CiteSeerX

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Author: Gerhard Wellein
Gerhard Wellein
Rolf Rabenseifner
Rolf Rabenseifner
Publication venue
Publication date
Field of study

Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside each node. The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming models on hybrid architectures. The paper focuses on bandwidth and latency aspects, and also on whether programming paradigms can separate the optimization of communication and computation. Benchmark results are presented for hybrid and pure MPI communication. This paper analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes

CiteSeerX

The controlled logical clock, a global clock for trace-based monitoring of parallel applications

Author: Rabenseifner Rolf
Publication venue
Publication date: 12/11/2015
Field of study

Das Aufzeichnen und Darstellen des Programmflusses sowie des Nachrichtenaustauschs paralleler Anwendungen ist schwierig, wenn jeder Prozessor eine eigene Uhr besitzt, und diese Uhren nicht synchronisiert sind. Mehrere Strategien zur Bildung einer globalen Uhrzeit werden in einem Überblick dargestellt, und die Grenzen werden aufgezeigt. Die geregelte logische Uhr, eine neue Methode auf der Basis von Lamports logischer Uhr, wird vorgestellt. Ungenaue Zeitstempel aus Tracefiles werden derart modifiziert, daß sie die Uhrenbedingung erfüllen, d.h. daß der Empfang einer Nachricht einen späteren Zeitstempel als das zugehörige Sendeereignis besitzt. Mit dem Regler wird das Maximum aller lokalen Prozessoruhren als Basis für eine globale Zeit angenähert. Die korrigierten Zeitstempel ermöglichen Leistungsmessungen, bei denen die Ereignisse in verschiedenen Prozessen liegen. Eine stückweise lineare rückwärtige Amortisation der Uhrenkorrekturen garantiert, daß die Fehler bei Messungen von Zeitintervallen zwischen Ereignissen im selben Prozeß minimal sind. Bei der Erstellung eines Tracefiles ist kein zusätzlicher Protokollaufwand nötig. Die geregelte logische Uhr kann als Filter für Tracefiles implementiert werden. Sie kann aber auch in Monitor- und Debuggingwerkzeuge integriert werden.Event tracing and monitoring of the program flow and the message exchanges of parallel applications are difficult if each processor has its own unsynchronized clock. A survey of several strategies to generate a global time is given, and the limits are discussed. The controlled logical clock is presented. It is a new method based on Lamport's logical clock and provides a method to modify inexact timestamps of tracefiles. The new timestamps guarantee the clock condition, i.e. that the receive event of a message has a later timestamp than the send event. With the control algorithm an approximation of the maximum of all local processor clocks is used as global time. The corrected timestamps can also be used for performance measurements with pairs of events in different processes. A piecewise linear backward amortisation of the clock corrections guarantees a minimal error for measurements of time intervals between events in the same process. No additional protocol overhead is needed for the new method while tracing the application. The method can be implemented as a filter for tracefiles or it can be integrated into monitor and debug tools for parallel applications

Hybrid Parallel Programming: Performance Problems and Chances

Author: Rolf Rabenseifner
Publication venue
Publication date
Field of study

This paper analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. Benchmark results show, that the hybridmasteronly programming model can be used more e- ciently on some vector-type systems, although this model suers from sleeping application threads while the master thread communicates. This paper analyses strategies to overcome typical drawbacks of this easily usable programming scheme on systems with weaker inter-connects. Best performance can be achieved with overlapping communication and computation, but this scheme is lacking in ease of us

CiteSeerX

The controlled logical clock - a global time for trace based software monitoring of parallel applications in workstation clusters

Author: Rolf Rabenseifner
Publication venue
Publication date
Field of study

CiteSeerX