124 research outputs found

    Performance of MPI on the CRAY T3E-512

    Get PDF
    The CRAY T3E-512 is currently the most powerful machine available at RUS/hww. Although it provides support for shared memory the natural programming model for the machine is message passing. Since RUS has decided to support primarily the MPI standard we have found it useful to test the performance of MPI on the machine for several standard message passing constructs

    Systolic and Hyper-Systolic Algorithms for the Gravitational N-Body Problem, with an Application to Brownian Motion

    Full text link
    A systolic algorithm rhythmically computes and passes data through a network of processors. We investigate the performance of systolic algorithms for implementing the gravitational N-body problem on distributed-memory computers. Systolic algorithms minimize memory requirements by distributing the particles between processors. We show that the performance of systolic routines can be greatly enhanced by the use of non-blocking communication, which allows particle coordinates to be communicated at the same time that force calculations are being carried out. Hyper-systolic algorithms reduce the communication complexity at the expense of increased memory demands. As an example of an application requiring large N, we use the systolic algorithm to carry out direct-summation simulations using 10^6 particles of the Brownian motion of the supermassive black hole at the center of the Milky Way galaxy. We predict a 3D random velocity of 0.4 km/s for the black hole.Comment: 33 pages, 10 postscript figure

    Performance Evaluation of Supercomputers using HPCC and IMB Benchmarks

    Get PDF
    The HPC Challenge (HPCC) benchmark suite and the Intel MPI Benchmark (IMB) are used to compare and evaluate the combined performance of processor, memory subsystem and interconnect fabric of five leading supercomputers - SGI Altix BX2, Cray XI, Cray Opteron Cluster, Dell Xeon cluster, and NEC SX-8. These five systems use five different networks (SGI NUMALINK4, Cray network, Myrinet, InfiniBand, and NEC IXS). The complete set of HPCC benchmarks are run on each of these systems. Additionally, we present Intel MPI Benchmarks (IMB) results to study the performance of 11 MPI communication functions on these systems

    Scalability of Incompressible Flow Computations on Multi-GPU Clusters Using Dual-Level and Tri-Level Parallelism

    Get PDF
    High performance computing using graphics processing units (GPUs) is gaining popularity in the scientific computing field, with many large compute clusters being augmented with multiple GPUs in each node. We investigate hybrid tri-level (MPI-OpenMP-CUDA) parallel implementations to explore the efficiency and scalability of incompressible flow computations on GPU clusters up to 128 GPUS. This work details some of the unique issues faced when merging fine-grain parallelism on the GPU using CUDA with coarse-grain parallelism using OpenMP for intra-node and MPI for inter-node communication. Comparisons between the tri-level MPI-OpenMP-CUDA and dual-level MPI-CUDA implementations are shown using computationally large computational fluid dynamics (CFD) simulations. Our results demonstrate that a tri-level parallel implementation does not provide a significant advantage in performance over the dual-level implementation, however further research is needed to justify our conclusion for a cluster with a high GPU per node density or when using software that can utilize OpenMP’s fine-grain parallelism more effectively

    Proceedings of the 12th International Conference on Kinanthropology

    Get PDF
    Proceedings of the 12th Conference of Sport and Quality of Life 2019 gatheres submissions of participants of the conference. Every submission is the result of positive evaluation by reviewers from the corresponding field. Conference is divided into sections – Analysis of human movement; Sport training, nutrition and regeneration; Sport and social sciences; Active ageing and sarcopenia; Strength and conditioning training; section for PhD students

    Automatic Profiling of MPI Applications with Hardware Performance Counters

    No full text
    This paper presents an automatic counter instrumentation and pro ling module added to the MPI library on Cray T3E and SGI Origin2000 systems. A detailed summary of the hardware performance counters and the MPI calls of any MPI production program is gathered during execution and written in MPI Finalize on a special syslog file. The user can get the same information in a different file. Statistical summaries are computed weekly and monthly. The paper describes experiences with this library on the Cray T3E systems at HLRS Stuttgart and TU Dresden. It focuses on the problems integrating the hardware performance counters into MPI counter profiling and presents first results with these counters. Also, a second software design is described that allows the integration of the pro ling layer into a dynamic shared object MPI library without consuming the user's PMPI profiling interface
    corecore