309 research outputs found
Performance analysis of direct N-body algorithms for astrophysical simulations on distributed systems
We discuss the performance of direct summation codes used in the simulation
of astrophysical stellar systems on highly distributed architectures. These
codes compute the gravitational interaction among stars in an exact way and
have an O(N^2) scaling with the number of particles. They can be applied to a
variety of astrophysical problems, like the evolution of star clusters, the
dynamics of black holes, the formation of planetary systems, and cosmological
simulations. The simulation of realistic star clusters with sufficiently high
accuracy cannot be performed on a single workstation but may be possible on
parallel computers or grids. We have implemented two parallel schemes for a
direct N-body code and we study their performance on general purpose parallel
computers and large computational grids. We present the results of timing
analyzes conducted on the different architectures and compare them with the
predictions from theoretical models. We conclude that the simulation of star
clusters with up to a million particles will be possible on large distributed
computers in the next decade. Simulating entire galaxies however will in
addition require new hybrid methods to speedup the calculation.Comment: 22 pages, 8 figures, accepted for publication in Parallel Computin
Analyzing and Modeling the Performance of the HemeLB Lattice-Boltzmann Simulation Environment
We investigate the performance of the HemeLB lattice-Boltzmann simulator for
cerebrovascular blood flow, aimed at providing timely and clinically relevant
assistance to neurosurgeons. HemeLB is optimised for sparse geometries,
supports interactive use, and scales well to 32,768 cores for problems with ~81
million lattice sites. We obtain a maximum performance of 29.5 billion site
updates per second, with only an 11% slowdown for highly sparse problems (5%
fluid fraction). We present steering and visualisation performance measurements
and provide a model which allows users to predict the performance, thereby
determining how to run simulations with maximum accuracy within time
constraints.Comment: Accepted by the Journal of Computational Science. 33 pages, 16
figures, 7 table
A Flexible Patch-Based Lattice Boltzmann Parallelization Approach for Heterogeneous GPU-CPU Clusters
Sustaining a large fraction of single GPU performance in parallel
computations is considered to be the major problem of GPU-based clusters. In
this article, this topic is addressed in the context of a lattice Boltzmann
flow solver that is integrated in the WaLBerla software framework. We propose a
multi-GPU implementation using a block-structured MPI parallelization, suitable
for load balancing and heterogeneous computations on CPUs and GPUs. The
overhead required for multi-GPU simulations is discussed in detail and it is
demonstrated that the kernel performance can be sustained to a large extent.
With our GPU implementation, we achieve nearly perfect weak scalability on
InfiniBand clusters. However, in strong scaling scenarios multi-GPUs make less
efficient use of the hardware than IBM BG/P and x86 clusters. Hence, a cost
analysis must determine the best course of action for a particular simulation
task. Additionally, weak scaling results of heterogeneous simulations conducted
on CPUs and GPUs simultaneously are presented using clusters equipped with
varying node configurations.Comment: 20 pages, 12 figure
Implementation and scaling of the fully coupled Terrestrial Systems Modeling Platform (TerrSysMP) in a massively parallel supercomputing environment – a case study on JUQUEEN (IBM Blue Gene/Q)
Continental-scale hyper-resolution simulations constitute a grand challenge in characterizing non-linear feedbacks of states and fluxes of the coupled water, energy, and biogeochemical cycles of terrestrial systems. Tackling this challenge requires advanced coupling and supercomputing technologies for earth system models that are discussed in this study, utilizing the example of the implementation of the newly developed Terrestrial Systems Modeling Platform (TerrSysMP) on JUQUEEN (IBM Blue Gene/Q) of the Jülich Supercomputing Centre, Germany. The applied coupling strategies rely on the Multiple Program Multiple Data (MPMD) paradigm and require memory and load balancing considerations in the exchange of the coupling fields between different component models and allocation of computational resources, respectively. These considerations can be reached with advanced profiling and tracing tools leading to the efficient use of massively parallel computing environments, which is then mainly determined by the parallel performance of individual component models. However, the problem of model I/O and initialization in the peta-scale range requires major attention, because this constitutes a true big data challenge in the perspective of future exa-scale capabilities, which is unsolved
10181 Abstracts Collection -- Program Development for Extreme-Scale Computing
From May 2nd to May 7th, 2010, the Dagstuhl Seminar 10181
``Program Development for Extreme-Scale Computing \u27\u27
was held in Schloss Dagstuhl~--~Leibniz Center for Informatics.
During the seminar, several participants presented their current
research, and ongoing work and open problems were discussed. Abstracts of
the presentations given during the seminar as well as abstracts of
seminar results and ideas are put together in this paper.
Links to extended abstracts or full papers are provided, if available
Towards Loosely-Coupled Programming on Petascale Systems
We have extended the Falkon lightweight task execution framework to make
loosely coupled programming on petascale systems a practical and useful
programming model. This work studies and measures the performance factors
involved in applying this approach to enable the use of petascale systems by a
broader user community, and with greater ease. Our work enables the execution
of highly parallel computations composed of loosely coupled serial jobs with no
modifications to the respective applications. This approach allows a new-and
potentially far larger-class of applications to leverage petascale systems,
such as the IBM Blue Gene/P supercomputer. We present the challenges of I/O
performance encountered in making this model practical, and show results using
both microbenchmarks and real applications from two domains: economic energy
modeling and molecular dynamics. Our benchmarks show that we can scale up to
160K processor-cores with high efficiency, and can achieve sustained execution
rates of thousands of tasks per second.Comment: IEEE/ACM International Conference for High Performance Computing,
Networking, Storage and Analysis (SuperComputing/SC) 200
- …