3,233 research outputs found
Performance Analysis for Mesh and Mesh-Spectral Archetype Applications
This document outlines a simple method for benchmarking a parallel communication library and for using the results to model the performance of applications developed with that communication library. We use compositional performance analysis - decomposing a parallel program into its modular parts and analyzing their respective performances - to gain perspective on the performance of the whole program. This model is useful for predicting parallel program execution times for different types of program archetypes, (e.g., mesh and mesh-spectral) using communication libraries built with different message-passing schemes (e.g., Fortran M and Fortran with MPI) running on different architectures (e.g., IBM SP2 and a network of Pentium personal computers)
HPCmatlab: A Framework for Fast Prototyping of Parallel Applications in Matlab
AbstractThe HPCmatlab framework has been developed for Distributed Memory Programming in Matlab/Octave using the Message Passing Interface (MPI). The communication routines in the MPI library are implemented using MEX wrappers. Point-to-point, collective as well as one-sided communication is supported. Benchmarking results show better performance than the Mathworks Distributed Computing Server. HPCmatlab has been used to successfully parallelize and speed up Matlab applications developed for scientific computing. The application results show good scalability, while preserving the ease of programmability. HPCmatlab also enables shared memory programming using Pthreads and Parallel I/O using the ADIOS package
Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation
The increasing number of processing elements and decreas- ing memory to core
ratio in modern high-performance platforms makes efficient strong scaling a key
requirement for numerical algorithms. In order to achieve efficient scalability
on massively parallel systems scientific software must evolve across the entire
stack to exploit the multiple levels of parallelism exposed in modern
architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP
parallelisation to optimise parallel sparse matrix-vector multiplication in
PETSc, a widely used scientific library for the scalable solution of partial
differential equations. Using large matrices generated by Fluidity, an open
source CFD application code which uses PETSc as its linear solver engine, we
evaluate the effect of explicit communication overlap using task-based
parallelism and show how to further improve performance by explicitly load
balancing threads within MPI processes. We demonstrate a significant speedup
over the pure-MPI mode and efficient strong scaling of sparse matrix-vector
multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems
Recommended from our members
Benchmarking the Intel®Xeon®Platinum 8160 Processor
This report presents a set of results for different microbenchmarks and applications on the Intel
Xeon Platinum8160 Processor, formerly known as Skylake. For simplicity, we will use both Skylake
and SKX to refer to this processor. We use the Skylake nodes that will be available in Stampede2.
This systemwill provide Intel Knights Landing and Skylake chips interconnected by a 100 Gb/sec
Intel Omni-Path (OPA) network with a fat tree topology. The peak performance of the system will
be 18 PF.Texas Advanced Computing Center (TACC
Study of Raspberry Pi 2 Quad-core Cortex A7 CPU Cluster as a Mini Supercomputer
High performance computing (HPC) devices is no longer exclusive for academic,
R&D, or military purposes. The use of HPC device such as supercomputer now
growing rapidly as some new area arise such as big data, and computer
simulation. It makes the use of supercomputer more inclusive. Todays
supercomputer has a huge computing power, but requires an enormous amount of
energy to operate. In contrast a single board computer (SBC) such as Raspberry
Pi has minimum computing power, but require a small amount of energy to
operate, and as a bonus it is small and cheap. This paper covers the result of
utilizing many Raspberry Pi 2 SBCs, a quad-core Cortex A7 900 MHz, as a cluster
to compensate its computing power. The high performance linpack (HPL) is used
to benchmark the computing power, and a power meter with resolution 10mV / 10mA
is used to measure the power consumption. The experiment shows that the
increase of number of cores in every SBC member in a cluster is not giving
significant increase in computing power. This experiment give a recommendation
that 4 nodes is a maximum number of nodes for SBC cluster based on the
characteristic of computing performance and power consumption.Comment: Pre-print of conference paper on International Conference on
Information Technology and Electrical Engineerin
- …