3,612 research outputs found
Parallelization of adaptive MC Integrators
Monte Carlo (MC) methods for numerical integration seem to be embarassingly
parallel on first sight. When adaptive schemes are applied in order to enhance
convergence however, the seemingly most natural way of replicating the whole
job on each processor can potentially ruin the adaptive behaviour. Using the
popular VEGAS-Algorithm as an example an economic method of semi-micro
parallelization with variable grain-size is presented and contrasted with
another straightforward approach of macro-parallelization. A portable
implementation of this semi-micro parallelization is used in the xloops-project
and is made publicly available.Comment: 10 pages, LaTeX2e, 1 pstricks-figure included and 2 eps-figures
inserted via epsfig. To appear in Comput. Phys. Commu
Multicore-optimized wavefront diamond blocking for optimizing stencil updates
The importance of stencil-based algorithms in computational science has
focused attention on optimized parallel implementations for multilevel
cache-based processors. Temporal blocking schemes leverage the large bandwidth
and low latency of caches to accelerate stencil updates and approach
theoretical peak performance. A key ingredient is the reduction of data traffic
across slow data paths, especially the main memory interface. In this work we
combine the ideas of multi-core wavefront temporal blocking and diamond tiling
to arrive at stencil update schemes that show large reductions in memory
pressure compared to existing approaches. The resulting schemes show
performance advantages in bandwidth-starved situations, which are exacerbated
by the high bytes per lattice update case of variable coefficients. Our thread
groups concept provides a controllable trade-off between concurrency and memory
usage, shifting the pressure between the memory interface and the CPU. We
present performance results on a contemporary Intel processor
PORTA: A three-dimensional multilevel radiative transfer code for modeling the intensity and polarization of spectral lines with massively parallel computers
The interpretation of the intensity and polarization of the spectral line
radiation produced in the atmosphere of the Sun and of other stars requires
solving a radiative transfer problem that can be very complex, especially when
the main interest lies in modeling the spectral line polarization produced by
scattering processes and the Hanle and Zeeman effects. One of the difficulties
is that the plasma of a stellar atmosphere can be highly inhomogeneous and
dynamic, which implies the need to solve the non-equilibrium problem of the
generation and transfer of polarized radiation in realistic three-dimensional
(3D) stellar atmospheric models. Here we present PORTA, an efficient multilevel
radiative transfer code we have developed for the simulation of the spectral
line polarization caused by scattering processes and the Hanle and Zeeman
effects in 3D models of stellar atmospheres. The numerical method of solution
is based on the non-linear multigrid iterative method and on a novel
short-characteristics formal solver of the Stokes-vector transfer equation
which uses monotonic B\'ezier interpolation. Therefore, with PORTA the
computing time needed to obtain at each spatial grid point the self-consistent
values of the atomic density matrix (which quantifies the excitation state of
the atomic system) scales linearly with the total number of grid points.
Another crucial feature of PORTA is its parallelization strategy, which allows
us to speed up the numerical solution of complicated 3D problems by several
orders of magnitude with respect to sequential radiative transfer approaches,
given its excellent linear scaling with the number of available processors. The
PORTA code can also be conveniently applied to solve the simpler 3D radiative
transfer problem of unpolarized radiation in multilevel systems.Comment: 15 pages, 15 figures, to appear in Astronomy and Astrophysic
Construction and Application of an AMR Algorithm for Distributed Memory Computers
While the parallelization of blockstructured adaptive mesh refinement techniques is relatively straight-forward on shared memory architectures, appropriate distribution strategies for the emerging generation of distributed
memory machines are a topic of on-going research. In this paper, a locality-preserving domain decomposition is proposed that partitions the entire AMR hierarchy from the base level on. It is shown that the approach reduces the
communication costs and simplifies the implementation. Emphasis is put on the effective parallelization of the flux correction procedure at coarse-fine boundaries, which is indispensable for conservative finite volume schemes. An
easily reproducible standard benchmark and a highly resolved parallel AMR
simulation of a diffracting hydrogen-oxygen detonation demonstrate the proposed
strategy in practice
Parallel load balancing strategy for Volume-of-Fluid methods on 3-D unstructured meshes
© 2016. This version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/l Volume-of-Fluid (VOF) is one of the methods of choice to reproduce the interface motion in the simulation of multi-fluid flows. One of its main strengths is its accuracy in capturing sharp interface geometries, although requiring for it a number of geometric calculations. Under these circumstances, achieving parallel performance on current supercomputers is a must. The main obstacle for the parallelization is that the computing costs are concentrated only in the discrete elements that lie on the interface between fluids. Consequently, if the interface is not homogeneously distributed throughout the domain, standard domain decomposition (DD) strategies lead to imbalanced workload distributions. In this paper, we present a new parallelization strategy for general unstructured VOF solvers, based on a dynamic load balancing process complementary to the underlying DD. Its parallel efficiency has been analyzed and compared to the DD one using up to 1024 CPU-cores on an Intel SandyBridge based supercomputer. The results obtained on the solution of several artificially generated test cases show a speedup of up to similar to 12x with respect to the standard DD, depending on the interface size, the initial distribution and the number of parallel processes engaged. Moreover, the new parallelization strategy presented is of general purpose, therefore, it could be used to parallelize any VOF solver without requiring changes on the coupled flow solver. Finally, note that although designed for the VOF method, our approach could be easily adapted to other interface-capturing methods, such as the Level-Set, which may present similar workload imbalances. (C) 2014 Elsevier Inc. Allrights reserved.Peer ReviewedPostprint (author's final draft
- …