3,528 research outputs found
Multi-core computation of transfer matrices for strip lattices in the Potts model
The transfer-matrix technique is a convenient way for studying strip lattices
in the Potts model since the compu- tational costs depend just on the periodic
part of the lattice and not on the whole. However, even when the cost is
reduced, the transfer-matrix technique is still an NP-hard problem since the
time T(|V|, |E|) needed to compute the matrix grows ex- ponentially as a
function of the graph width. In this work, we present a parallel
transfer-matrix implementation that scales performance under multi-core
architectures. The construction of the matrix is based on several repetitions
of the deletion- contraction technique, allowing parallelism suitable to
multi-core machines. Our experimental results show that the multi-core
implementation achieves speedups of 3.7X with p = 4 processors and 5.7X with p
= 8. The efficiency of the implementation lies between 60% and 95%, achieving
the best balance of speedup and efficiency at p = 4 processors for actual
multi-core architectures. The algorithm also takes advantage of the lattice
symmetry, making the transfer matrix computation to run up to 2X faster than
its non-symmetric counterpart and use up to a quarter of the original space
Efficient Parallelization of Short-Range Molecular Dynamics Simulations on Many-Core Systems
This article introduces a highly parallel algorithm for molecular dynamics
simulations with short-range forces on single node multi- and many-core
systems. The algorithm is designed to achieve high parallel speedups for
strongly inhomogeneous systems like nanodevices or nanostructured materials. In
the proposed scheme the calculation of the forces and the generation of
neighbor lists is divided into small tasks. The tasks are then executed by a
thread pool according to a dependent task schedule. This schedule is
constructed in such a way that a particle is never accessed by two threads at
the same time.Benchmark simulations on a typical 12 core machine show that the
described algorithm achieves excellent parallel efficiencies above 80 % for
different kinds of systems and all numbers of cores. For inhomogeneous systems
the speedups are strongly superior to those obtained with spatial
decomposition. Further benchmarks were performed on an Intel Xeon Phi
coprocessor. These simulations demonstrate that the algorithm scales well to
large numbers of cores.Comment: 12 pages, 8 figure
Improved parallelization techniques for the density matrix renormalization group
A distributed-memory parallelization strategy for the density matrix
renormalization group is proposed for cases where correlation functions are
required. This new strategy has substantial improvements with respect to
previous works. A scalability analysis shows an overall serial fraction of 9.4%
and an efficiency of around 60% considering up to eight nodes. Sources of
possible parallel slowdown are pointed out and solutions to circumvent these
issues are brought forward in order to achieve a better performance.Comment: 8 pages, 4 figures; version published in Computer Physics
Communication
SKIRT: hybrid parallelization of radiative transfer simulations
We describe the design, implementation and performance of the new hybrid
parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which
has been used extensively for modeling the continuum radiation of dusty
astrophysical systems including late-type galaxies and dusty tori. The hybrid
scheme combines distributed memory parallelization, using the standard Message
Passing Interface (MPI) to communicate between processes, and shared memory
parallelization, providing multiple execution threads within each process to
avoid duplication of data structures. The synchronization between multiple
threads is accomplished through atomic operations without high-level locking
(also called lock-free programming). This improves the scaling behavior of the
code and substantially simplifies the implementation of the hybrid scheme. The
result is an extremely flexible solution that adjusts to the number of
available nodes, processors and memory, and consequently performs well on a
wide variety of computing architectures.Comment: 21 pages, 20 figure
A fast GPU Monte Carlo Radiative Heat Transfer Implementation for Coupling with Direct Numerical Simulation
We implemented a fast Reciprocal Monte Carlo algorithm, to accurately solve
radiative heat transfer in turbulent flows of non-grey participating media that
can be coupled to fully resolved turbulent flows, namely to Direct Numerical
Simulation (DNS). The spectrally varying absorption coefficient is treated in a
narrow-band fashion with a correlated-k distribution. The implementation is
verified with analytical solutions and validated with results from literature
and line-by-line Monte Carlo computations. The method is implemented on GPU
with a thorough attention to memory transfer and computational efficiency. The
bottlenecks that dominate the computational expenses are addressed and several
techniques are proposed to optimize the GPU execution. By implementing the
proposed algorithmic accelerations, a speed-up of up to 3 orders of magnitude
can be achieved, while maintaining the same accuracy
A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence
A hybrid scheme that utilizes MPI for distributed memory parallelism and
OpenMP for shared memory parallelism is presented. The work is motivated by the
desire to achieve exceptionally high Reynolds numbers in pseudospectral
computations of fluid turbulence on emerging petascale, high core-count,
massively parallel processing systems. The hybrid implementation derives from
and augments a well-tested scalable MPI-parallelized pseudospectral code. The
hybrid paradigm leads to a new picture for the domain decomposition of the
pseudospectral grids, which is helpful in understanding, among other things,
the 3D transpose of the global data that is necessary for the parallel fast
Fourier transforms that are the central component of the numerical
discretizations. Details of the hybrid implementation are provided, and
performance tests illustrate the utility of the method. It is shown that the
hybrid scheme achieves near ideal scalability up to ~20000 compute cores with a
maximum mean efficiency of 83%. Data are presented that demonstrate how to
choose the optimal number of MPI processes and OpenMP threads in order to
optimize code performance on two different platforms.Comment: Submitted to Parallel Computin
- …