3,528 research outputs found

    Multi-core computation of transfer matrices for strip lattices in the Potts model

    Full text link
    The transfer-matrix technique is a convenient way for studying strip lattices in the Potts model since the compu- tational costs depend just on the periodic part of the lattice and not on the whole. However, even when the cost is reduced, the transfer-matrix technique is still an NP-hard problem since the time T(|V|, |E|) needed to compute the matrix grows ex- ponentially as a function of the graph width. In this work, we present a parallel transfer-matrix implementation that scales performance under multi-core architectures. The construction of the matrix is based on several repetitions of the deletion- contraction technique, allowing parallelism suitable to multi-core machines. Our experimental results show that the multi-core implementation achieves speedups of 3.7X with p = 4 processors and 5.7X with p = 8. The efficiency of the implementation lies between 60% and 95%, achieving the best balance of speedup and efficiency at p = 4 processors for actual multi-core architectures. The algorithm also takes advantage of the lattice symmetry, making the transfer matrix computation to run up to 2X faster than its non-symmetric counterpart and use up to a quarter of the original space

    Efficient Parallelization of Short-Range Molecular Dynamics Simulations on Many-Core Systems

    Get PDF
    This article introduces a highly parallel algorithm for molecular dynamics simulations with short-range forces on single node multi- and many-core systems. The algorithm is designed to achieve high parallel speedups for strongly inhomogeneous systems like nanodevices or nanostructured materials. In the proposed scheme the calculation of the forces and the generation of neighbor lists is divided into small tasks. The tasks are then executed by a thread pool according to a dependent task schedule. This schedule is constructed in such a way that a particle is never accessed by two threads at the same time.Benchmark simulations on a typical 12 core machine show that the described algorithm achieves excellent parallel efficiencies above 80 % for different kinds of systems and all numbers of cores. For inhomogeneous systems the speedups are strongly superior to those obtained with spatial decomposition. Further benchmarks were performed on an Intel Xeon Phi coprocessor. These simulations demonstrate that the algorithm scales well to large numbers of cores.Comment: 12 pages, 8 figure

    Improved parallelization techniques for the density matrix renormalization group

    Full text link
    A distributed-memory parallelization strategy for the density matrix renormalization group is proposed for cases where correlation functions are required. This new strategy has substantial improvements with respect to previous works. A scalability analysis shows an overall serial fraction of 9.4% and an efficiency of around 60% considering up to eight nodes. Sources of possible parallel slowdown are pointed out and solutions to circumvent these issues are brought forward in order to achieve a better performance.Comment: 8 pages, 4 figures; version published in Computer Physics Communication

    SKIRT: hybrid parallelization of radiative transfer simulations

    Full text link
    We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modeling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behavior of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.Comment: 21 pages, 20 figure

    A fast GPU Monte Carlo Radiative Heat Transfer Implementation for Coupling with Direct Numerical Simulation

    Full text link
    We implemented a fast Reciprocal Monte Carlo algorithm, to accurately solve radiative heat transfer in turbulent flows of non-grey participating media that can be coupled to fully resolved turbulent flows, namely to Direct Numerical Simulation (DNS). The spectrally varying absorption coefficient is treated in a narrow-band fashion with a correlated-k distribution. The implementation is verified with analytical solutions and validated with results from literature and line-by-line Monte Carlo computations. The method is implemented on GPU with a thorough attention to memory transfer and computational efficiency. The bottlenecks that dominate the computational expenses are addressed and several techniques are proposed to optimize the GPU execution. By implementing the proposed algorithmic accelerations, a speed-up of up to 3 orders of magnitude can be achieved, while maintaining the same accuracy

    A hybrid MPI-OpenMP scheme for scalable parallel pseudospectral computations for fluid turbulence

    Get PDF
    A hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP for shared memory parallelism is presented. The work is motivated by the desire to achieve exceptionally high Reynolds numbers in pseudospectral computations of fluid turbulence on emerging petascale, high core-count, massively parallel processing systems. The hybrid implementation derives from and augments a well-tested scalable MPI-parallelized pseudospectral code. The hybrid paradigm leads to a new picture for the domain decomposition of the pseudospectral grids, which is helpful in understanding, among other things, the 3D transpose of the global data that is necessary for the parallel fast Fourier transforms that are the central component of the numerical discretizations. Details of the hybrid implementation are provided, and performance tests illustrate the utility of the method. It is shown that the hybrid scheme achieves near ideal scalability up to ~20000 compute cores with a maximum mean efficiency of 83%. Data are presented that demonstrate how to choose the optimal number of MPI processes and OpenMP threads in order to optimize code performance on two different platforms.Comment: Submitted to Parallel Computin
    corecore