4,353 research outputs found
High Performance P3M N-body code: CUBEP3M
This paper presents CUBEP3M, a publicly-available high performance
cosmological N-body code and describes many utilities and extensions that have
been added to the standard package. These include a memory-light runtime SO
halo finder, a non-Gaussian initial conditions generator, and a system of
unique particle identification. CUBEP3M is fast, its accuracy is tuneable to
optimize speed or memory, and has been run on more than 27,000 cores, achieving
within a factor of two of ideal weak scaling even at this problem size. The
code can be run in an extra-lean mode where the peak memory imprint for large
runs is as low as 37 bytes per particles, which is almost two times leaner than
other widely used N-body codes. However, load imbalances can increase this
requirement by a factor of two, such that fast configurations with all the
utilities enabled and load imbalances factored in require between 70 and 120
bytes per particles. CUBEP3M is well designed to study large scales
cosmological systems, where imbalances are not too large and adaptive
time-stepping not essential. It has already been used for a broad number of
science applications that require either large samples of non-linear
realizations or very large dark matter N-body simulations, including
cosmological reionization, halo formation, baryonic acoustic oscillations, weak
lensing or non-Gaussian statistics. We discuss the structure, the accuracy,
known systematic effects and the scaling performance of the code and its
utilities, when applicable.Comment: 20 pages, 17 figures, added halo profiles, updated to match MNRAS
accepted versio
A Flexible Patch-Based Lattice Boltzmann Parallelization Approach for Heterogeneous GPU-CPU Clusters
Sustaining a large fraction of single GPU performance in parallel
computations is considered to be the major problem of GPU-based clusters. In
this article, this topic is addressed in the context of a lattice Boltzmann
flow solver that is integrated in the WaLBerla software framework. We propose a
multi-GPU implementation using a block-structured MPI parallelization, suitable
for load balancing and heterogeneous computations on CPUs and GPUs. The
overhead required for multi-GPU simulations is discussed in detail and it is
demonstrated that the kernel performance can be sustained to a large extent.
With our GPU implementation, we achieve nearly perfect weak scalability on
InfiniBand clusters. However, in strong scaling scenarios multi-GPUs make less
efficient use of the hardware than IBM BG/P and x86 clusters. Hence, a cost
analysis must determine the best course of action for a particular simulation
task. Additionally, weak scaling results of heterogeneous simulations conducted
on CPUs and GPUs simultaneously are presented using clusters equipped with
varying node configurations.Comment: 20 pages, 12 figure
Machine learning for ultrafast X-ray diffraction patterns on large-scale GPU clusters
The classical method of determining the atomic structure of complex molecules
by analyzing diffraction patterns is currently undergoing drastic developments.
Modern techniques for producing extremely bright and coherent X-ray lasers
allow a beam of streaming particles to be intercepted and hit by an ultrashort
high energy X-ray beam. Through machine learning methods the data thus
collected can be transformed into a three-dimensional volumetric intensity map
of the particle itself. The computational complexity associated with this
problem is very high such that clusters of data parallel accelerators are
required.
We have implemented a distributed and highly efficient algorithm for
inversion of large collections of diffraction patterns targeting clusters of
hundreds of GPUs. With the expected enormous amount of diffraction data to be
produced in the foreseeable future, this is the required scale to approach real
time processing of data at the beam site. Using both real and synthetic data we
look at the scaling properties of the application and discuss the overall
computational viability of this exciting and novel imaging technique
Initial Conditions for Large Cosmological Simulations
This technical paper describes a software package that was designed to
produce initial conditions for large cosmological simulations in the context of
the Horizon collaboration. These tools generalize E. Bertschinger's Grafic1
software to distributed parallel architectures and offer a flexible alternative
to the Grafic2 software for ``zoom'' initial conditions, at the price of large
cumulated cpu and memory usage. The codes have been validated up to resolutions
of 4096^3 and were used to generate the initial conditions of large
hydrodynamical and dark matter simulations. They also provide means to generate
constrained realisations for the purpose of generating initial conditions
compatible with, e.g. the local group, or the SDSS catalog.Comment: 12 pages, 11 figures, submitted to ApJ
User-Friendly Parallel Computations with Econometric Examples
This paper shows how a high level matrix programming language may be used to perform Monte Carlo simulation, bootstrapping, estimation by maximum likelihood and GMM, and kernel regression in parallel on symmetric multiprocessor computers or clusters of workstations. The implementation of parallelization is done in a way such that an investigator may use the programs without any knowledge of parallel programming. A bootable CD that allows rapid creation of a cluster for parallel computing is introduced. Examples show that parallelization can lead to important reductions in computational time. Detailed discussion of how the Monte Carlo problem was parallelized is included as an example for learning to write parallel programs for Octave.parallel computing, Monte Carlo, bootstrapping,maximum likelihood, GMM, kernel regression
Solution of the Skyrme-Hartree-Fock-Bogolyubov equations in the Cartesian deformed harmonic-oscillator basis. (VII) HFODD (v2.49t): a new version of the program
We describe the new version (v2.49t) of the code HFODD which solves the
nuclear Skyrme Hartree-Fock (HF) or Skyrme Hartree-Fock-Bogolyubov (HFB)
problem by using the Cartesian deformed harmonic-oscillator basis. In the new
version, we have implemented the following physics features: (i) the isospin
mixing and projection, (ii) the finite temperature formalism for the HFB and
HF+BCS methods, (iii) the Lipkin translational energy correction method, (iv)
the calculation of the shell correction. A number of specific numerical methods
have also been implemented in order to deal with large-scale multi-constraint
calculations and hardware limitations: (i) the two-basis method for the HFB
method, (ii) the Augmented Lagrangian Method (ALM) for multi-constraint
calculations, (iii) the linear constraint method based on the approximation of
the RPA matrix for multi-constraint calculations, (iv) an interface with the
axial and parity-conserving Skyrme-HFB code HFBTHO, (v) the mixing of the HF or
HFB matrix elements instead of the HF fields. Special care has been paid to
using the code on massively parallel leadership class computers. For this
purpose, the following features are now available with this version: (i) the
Message Passing Interface (MPI) framework, (ii) scalable input data routines,
(iii) multi-threading via OpenMP pragmas, (iv) parallel diagonalization of the
HFB matrix in the simplex breaking case using the ScaLAPACK library. Finally,
several little significant errors of the previous published version were
corrected.Comment: Accepted for publication to Computer Physics Communications. Program
files re-submitted to Comp. Phys. Comm. Program Library after correction of
several minor bug
- …