2,712 research outputs found

    A Parallel Adaptive P3M code with Hierarchical Particle Reordering

    Full text link
    We discuss the design and implementation of HYDRA_OMP a parallel implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M) code HYDRA. The code is designed primarily for conducting cosmological hydrodynamic simulations and is written in Fortran77+OpenMP. A number of optimizations for RISC processors and SMP-NUMA architectures have been implemented, the most important optimization being hierarchical reordering of particles within chaining cells, which greatly improves data locality thereby removing the cache misses typically associated with linked lists. Parallel scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes for a variety of modern SMP architectures. We give performance data in terms of the number of particle updates per second, which is a more useful performance metric than raw MFlops. A basic version of the code will be made available to the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics Communication

    A fast GPU Monte Carlo Radiative Heat Transfer Implementation for Coupling with Direct Numerical Simulation

    Full text link
    We implemented a fast Reciprocal Monte Carlo algorithm, to accurately solve radiative heat transfer in turbulent flows of non-grey participating media that can be coupled to fully resolved turbulent flows, namely to Direct Numerical Simulation (DNS). The spectrally varying absorption coefficient is treated in a narrow-band fashion with a correlated-k distribution. The implementation is verified with analytical solutions and validated with results from literature and line-by-line Monte Carlo computations. The method is implemented on GPU with a thorough attention to memory transfer and computational efficiency. The bottlenecks that dominate the computational expenses are addressed and several techniques are proposed to optimize the GPU execution. By implementing the proposed algorithmic accelerations, a speed-up of up to 3 orders of magnitude can be achieved, while maintaining the same accuracy

    An Efficient Sliding Mesh Interface Method for High-Order Discontinuous Galerkin Schemes

    Full text link
    Sliding meshes are a powerful method to treat deformed domains in computational fluid dynamics, where different parts of the domain are in relative motion. In this paper, we present an efficient implementation of a sliding mesh method into a discontinuous Galerkin compressible Navier-Stokes solver and its application to a large eddy simulation of a 1-1/2 stage turbine. The method is based on the mortar method and is high-order accurate. It can handle three-dimensional sliding mesh interfaces with various interface shapes. For plane interfaces, which are the most common case, conservativity and free-stream preservation are ensured. We put an emphasis on efficient parallel implementation. Our implementation generates little computational and storage overhead. Inter-node communication via MPI in a dynamically changing mesh topology is reduced to a bare minimum by ensuring a priori information about communication partners and data sorting. We provide performance and scaling results showing the capability of the implementation strategy. Apart from analytical validation computations and convergence results, we present a wall-resolved implicit LES of the 1-1/2 stage Aachen turbine test case as a large scale practical application example

    Hypercube algorithms on mesh connected multicomputers

    Get PDF
    A new methodology named CALMANT (CC-cube Algorithms on Meshes and Tori) for mapping a type of algorithm that we call CC-cube algorithm onto multicomputers with hypercube, mesh, or torus interconnection topology is proposed. This methodology is suitable when the initial problem can be expressed as a set of processes that communicate through a hypercube topology (a CC-cube algorithm). There are many important algorithms that fit into the CC-cube type. CALMANT is based on three different techniques: (a) the standard embedding to assign the processes of the algorithm to the nodes of the mesh multicomputer; (b) the communication pipelining technique to increase the level of communication parallelism inherent in the CC-cube algorithms; and (c) optimal message-scheduling algorithms proposed in this work in order to avoid conflicts and minimizing in this way the communication time. Although CALMANT is proposed for multicomputers with different interconnection network topologies, the paper only focuses on the particular case of meshes.Peer ReviewedPostprint (published version

    Numerical wave propagation for the triangular P1DGP1_{DG}-P2P2 finite element pair

    Full text link
    Inertia-gravity mode and Rossby mode dispersion properties are examined for discretisations of the linearized rotating shallow-water equations using the P1DGP1_{DG}-P2P2 finite element pair on arbitrary triangulations in planar geometry. A discrete Helmholtz decomposition of the functions in the velocity space based on potentials taken from the pressure space is used to provide a complete description of the numerical wave propagation for the discretised equations. In the ff-plane case, this decomposition is used to obtain decoupled equations for the geostrophic modes, the inertia-gravity modes, and the inertial oscillations. As has been noticed previously, the geostrophic modes are steady. The Helmholtz decomposition is used to show that the resulting inertia-gravity wave equation is third-order accurate in space. In general the \pdgp finite element pair is second-order accurate, so this leads to very accurate wave propagation. It is further shown that the only spurious modes supported by this discretisation are spurious inertial oscillations which have frequency ff, and which do not propagate. The Helmholtz decomposition also allows a simple derivation of the quasi-geostrophic limit of the discretised P1DGP1_{DG}-P2P2 equations in the β\beta-plane case, resulting in a Rossby wave equation which is also third-order accurate.Comment: Revised version prior to final journal submissio

    Vectorization and Parallelization of the Adaptive Mesh Refinement N-body Code

    Full text link
    In this paper, we describe our vectorized and parallelized adaptive mesh refinement (AMR) N-body code with shared time steps, and report its performance on a Fujitsu VPP5000 vector-parallel supercomputer. Our AMR N-body code puts hierarchical meshes recursively where higher resolution is required and the time step of all particles are the same. The parts which are the most difficult to vectorize are loops that access the mesh data and particle data. We vectorized such parts by changing the loop structure, so that the innermost loop steps through the cells instead of the particles in each cell, in other words, by changing the loop order from the depth-first order to the breadth-first order. Mass assignment is also vectorizable using this loop order exchange and splitting the loop into 2Ndim2^{N_{dim}} loops, if the cloud-in-cell scheme is adopted. Here, NdimN_{dim} is the number of dimension. These vectorization schemes which eliminate the unvectorized loops are applicable to parallelization of loops for shared-memory multiprocessors. We also parallelized our code for distributed memory machines. The important part of parallelization is data decomposition. We sorted the hierarchical mesh data by the Morton order, or the recursive N-shaped order, level by level and split and allocated the mesh data to the processors. Particles are allocated to the processor to which the finest refined cells including the particles are also assigned. Our timing analysis using the Λ\Lambda-dominated cold dark matter simulations shows that our parallel code speeds up almost ideally up to 32 processors, the largest number of processors in our test.Comment: 21pages, 16 figures, to be published in PASJ (Vol. 57, No. 5, Oct. 2005
    • …
    corecore