44 research outputs found
Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks
The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems
Scalability of Incompressible Flow Computations on Multi-GPU Clusters Using Dual-Level and Tri-Level Parallelism
High performance computing using graphics processing units (GPUs) is gaining popularity in the scientific computing field, with many large compute clusters being augmented with multiple GPUs in each node. We investigate hybrid tri-level (MPI-OpenMP-CUDA) parallel implementations to explore the efficiency and scalability of incompressible flow computations on GPU clusters up to 128 GPUS. This work details some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism using OpenMP for intra-node and MPI for inter-node communication. Comparisons between the tri-level MPI-OpenMP-CUDA and dual-level MPI-CUDA implementations are shown using computationally large computational fluid dynamics (CFD) simulations. Our results demonstrate that a tri-level parallel implementation does not provide a significant advantage in performance over the dual-level implementation, however further research is needed to justify our conclusion for a cluster with a high GPU per node density or when using software that can utilize OpenMP’s fine-grain parallelism more effectively
Die geregelte logische Uhr, eine globale Uhr fuer tracebasierte Ueberwachung paralleler Anwendungen
Available from TIB Hannover: RO 7298(44) / FIZ - Fachinformationszzentrum Karlsruhe / TIB - Technische InformationsbibliothekSIGLEDEGerman
GPU-Aware Intranode MPI_Allreduce
Modern multi-core clusters are increasingly using GPUs to achieve higher performance and power efficiency. In such clusters, efficient communication among processes with data residing in GPU memory is of paramount importance to the performance of MPI applications. This paper investigates the efficient design of intranode MPI Allreduce operation in GPU clusters. We propose two design alternatives that ex-ploit in-GPU reduction and fast intranode communication capabilities of modern GPUs. Our GPU shared-buffer aware design and GPU-aware Binomial reduce-broadcast algorith-mic approach provide significant speedup over MVAPICH2 by up to 22 and 16 times, respectively
Effective File-I/O Bandwidth Benchmark
The effective I/O bandwidth benchmark (b{_}eff{_}io) covers two goals: (1) to achieve a characteristic average number for the I/O bandwidth achievable with parallel MPI-I/O applications, and (2) to get detailed information about several access patterns and buffer lengths. The benchmark examines ''first write'', ''rewrite'' and ''read'' access, strided (individual and shared pointers) and segmented collective patterns on one file per application and non-collective access to one file per process. The number of parallel accessing processes is also varied and well-formed I/O is compared with non-well formed. On systems, meeting the rule that the total memory can be written to disk in 10 minutes, the benchmark should not need more than 15 minutes for a first pass of all patterns. The benchmark is designed analogously to the effective bandwidth benchmark for message passing (b{_}eff) that characterizes the message passing capabilities of a system in a few minutes. First results of the b{_}eff{_}io benchmark are given for IBM SP and Cray T3E systems and compared with existing benchmarks based on parallel Posix-I/O