51 research outputs found
Performance of MPI on the CRAY T3E-512
The CRAY T3E-512 is currently the most powerful machine available at RUS/hww. Although it provides support for shared memory the natural programming model for the machine is message passing. Since RUS has decided to support primarily the MPI standard we have found it useful to test the performance of MPI on the machine for several standard message passing constructs
Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks
The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems
Automatic Profiling of MPI Applications with Hardware Performance Counters
This paper presents an automatic counter instrumentation and pro ling module added to the MPI library on Cray T3E and SGI Origin2000 systems. A detailed summary of the hardware performance counters and the MPI calls of any MPI production program is gathered during execution and written in MPI Finalize on a special syslog file. The user can get the same information in a different file. Statistical summaries are computed weekly and monthly. The paper describes experiences with this library on the Cray T3E systems at HLRS Stuttgart and TU Dresden. It focuses on the problems integrating the hardware performance counters into MPI counter profiling and presents first results with these counters. Also, a second software design is described that allows the integration of the pro ling layer into a dynamic shared object MPI library without consuming the user's PMPI profiling interface
Balance of HPC Systems Based on HPCC Benchmark Results
Abstract. Based on results reported by the HPC Challenge benchmark suite (HPCC), the balance between computational speed, communication bandwidth, and memory bandwidth is analyzed for HPC systems from Cray, NEC, IBM, and other vendors, and clusters with various network interconnects. Strength and weakness of the communication interconnect is examined for three communication patterns. The HPCC suite was released to analyze the performance of high-performance computing architectures using several kernels to measure different memory and hardware access patterns comprising latency based measurements, memory streaming, inter-process communication and floating point computation. HPCC defines a set of benchmarks augmenting the High Performance Linpack used in the Top500 list. This paper describes the inter-process communication benchmarks of this suite. Based on the effective bandwidth benchmark, a special parallel random and natural ring communication benchmark has been developed for HPCC. Ping-Pong benchmarks on a set of process pairs can be used for further characterization of a system. This paper analyzes first results achieved with HPCC. The focus of this paper is on the balance between computational speed, memory bandwidth, and inter-node communication. Keywords. HPCC, network bandwidth, effective bandwidth, Linpack
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside each node. The hybrid MPI+OpenMP programming model is compared with pure MPI, compiler based parallelization, and other parallel programming models on hybrid architectures. The paper focuses on bandwidth and latency aspects, and also on whether programming paradigms can separate the optimization of communication and computation. Benchmark results are presented for hybrid and pure MPI communication. This paper analyzes the strengths and weaknesses of several parallel programming models on clusters of SMP nodes
The controlled logical clock, a global clock for trace-based monitoring of parallel applications
Das Aufzeichnen und Darstellen des Programmflusses sowie des Nachrichtenaustauschs paralleler Anwendungen ist schwierig, wenn jeder Prozessor eine eigene Uhr besitzt, und diese Uhren nicht synchronisiert sind. Mehrere Strategien zur Bildung einer globalen Uhrzeit werden in einem Überblick dargestellt, und die Grenzen werden aufgezeigt. Die geregelte logische Uhr, eine neue Methode auf der Basis von Lamports logischer Uhr, wird vorgestellt. Ungenaue Zeitstempel aus Tracefiles werden derart modifiziert, daß sie die Uhrenbedingung erfüllen, d.h. daß der Empfang einer Nachricht einen späteren Zeitstempel als das zugehörige Sendeereignis besitzt. Mit dem Regler wird das Maximum aller lokalen Prozessoruhren als Basis für eine globale Zeit angenähert. Die korrigierten Zeitstempel ermöglichen Leistungsmessungen, bei denen die Ereignisse in verschiedenen Prozessen liegen. Eine stückweise lineare rückwärtige Amortisation der Uhrenkorrekturen garantiert, daß die Fehler bei Messungen von Zeitintervallen zwischen Ereignissen im selben Prozeß minimal sind. Bei der Erstellung eines Tracefiles ist kein zusätzlicher Protokollaufwand nötig. Die geregelte logische Uhr kann als Filter für Tracefiles implementiert werden. Sie kann aber auch in Monitor- und Debuggingwerkzeuge integriert werden.Event tracing and monitoring of the program flow and the message exchanges of parallel applications are difficult if each processor has its own unsynchronized clock. A survey of several strategies to generate a global time is given, and the limits are discussed. The controlled logical clock is presented. It is a new method based on Lamport's logical clock and provides a method to modify inexact timestamps of tracefiles. The new timestamps guarantee the clock condition, i.e. that the receive event of a message has a later timestamp than the send event. With the control algorithm an approximation of the maximum of all local processor clocks is used as global time. The corrected timestamps can also be used for performance measurements with pairs of events in different processes.
A piecewise linear backward amortisation of the clock corrections guarantees a minimal error for measurements of time intervals between events in the same process. No additional protocol overhead is needed for the new method while tracing the application. The method can be implemented as a filter for tracefiles or it can be integrated into monitor and debug tools for parallel applications
Hybrid Parallel Programming: Performance Problems and Chances
This paper analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. Benchmark results show, that the hybridmasteronly programming model can be used more e- ciently on some vector-type systems, although this model suers from sleeping application threads while the master thread communicates. This paper analyses strategies to overcome typical drawbacks of this easily usable programming scheme on systems with weaker inter-connects. Best performance can be achieved with overlapping communication and computation, but this scheme is lacking in ease of us
The controlled logical clock - a global time for trace based software monitoring of parallel applications in workstation clusters
Copyright 1996 IEEE. Copies may not used in any way that implies IEEE endorsement of a product or service of an employer
- …