235 research outputs found

    A Flexible Patch-Based Lattice Boltzmann Parallelization Approach for Heterogeneous GPU-CPU Clusters

    Full text link
    Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. In this article, this topic is addressed in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. We propose a multi-GPU implementation using a block-structured MPI parallelization, suitable for load balancing and heterogeneous computations on CPUs and GPUs. The overhead required for multi-GPU simulations is discussed in detail and it is demonstrated that the kernel performance can be sustained to a large extent. With our GPU implementation, we achieve nearly perfect weak scalability on InfiniBand clusters. However, in strong scaling scenarios multi-GPUs make less efficient use of the hardware than IBM BG/P and x86 clusters. Hence, a cost analysis must determine the best course of action for a particular simulation task. Additionally, weak scaling results of heterogeneous simulations conducted on CPUs and GPUs simultaneously are presented using clusters equipped with varying node configurations.Comment: 20 pages, 12 figure

    Massively parallel lattice–Boltzmann codes on large GPU clusters

    Get PDF
    This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics

    A holistic scalable implementation approach of the lattice Boltzmann method for CPU/GPU heterogeneous clusters

    Get PDF
    This is the author accepted manuscript. The final version is available from MDPI via the DOI in this record.Heterogeneous clusters are a widely utilized class of supercomputers assembled from different types of computing devices, for instance CPUs and GPUs, providing a huge computational potential. Programming them in a scalable way exploiting the maximal performance introduces numerous challenges such as optimizations for different computing devices, dealing with multiple levels of parallelism, the application of different programming models, work distribution, and hiding of communication with computation. We utilize the lattice Boltzmann method for fluid flow as a representative of a scientific computing application and develop a holistic implementation for large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and techniques ranging from optimizations for the particular computing devices to the orchestration of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with an implementation using all the available computational resources for the lattice Boltzmann method operators. Our approach shows excellent scalability behavior making it future-proof for heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of more than 90% are achieved leading to 2,604.72 GLUPS utilizing 24,576 CPU cores and 2,048 GPUs of the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 · 109 lattice cells.This work was supported by the German Research Foundation (DFG) as part of the Transregional Collaborative Research Centre “Invasive Computing” (SFB/TR 89). In addition, this work was supported by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID d68. We further thank the Max Planck Computing & Data Facility (MPCDF) and the Global Scientific Information and Computing Center (GSIC) for providing computational resources

    Performance evaluation of a two-dimensional lattice Boltzmann solver using CUDA and PGAS UPC based parallelisation

    Get PDF
    The Unified Parallel C (UPC) language from the Partitioned Global Address Space (PGAS) family unifies the advantages of shared and local memory spaces and offers a relatively straightforward code parallelisation with the Central Processing Unit (CPU). In contrast, the Computer Unified Device Architecture (CUDA) development kit gives a tool to make use of the Graphics Processing Unit (GPU). We provide a detailed comparison between these novel techniques through the parallelisation of a two-dimensional lattice Boltzmann method based fluid flow solver. Our comparison between the CUDA and UPC parallelisation takes into account the required conceptual effort, the performance gain, and the limitations of the approaches from the application oriented developers’ point of view. We demonstrated that UPC led to competitive efficiency with the local memory implementation. However, the performance of the shared memory code fell behind our expectations, and we concluded that the investigated UPC compilers could not efficiently treat the shared memory space. The CUDA implementation proved to be more complex compared to the UPC approach mainly because of the complicated memory structure of the graphics card which also makes GPUs suitable for the parallelisation of the lattice Boltzmann method

    An investigation of the efficient implementation of Cellular Automata on multi-core CPU and GPU hardware

    Get PDF
    Copyright © 2015 Elsevier. NOTICE: this is the author’s version of a work that was accepted for publication in Journal of Parallel and Distributed Computing . Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Journal of Parallel and Distributed Computing Vol. 77 (2015), DOI: 10.1016/j.jpdc.2014.10.011Cellular automata (CA) have proven to be excellent tools for the simulation of a wide variety of phenomena in the natural world. They are ideal candidates for acceleration with modern general purpose-graphical processing units (GPU/GPGPU) hardware that consists of large numbers of small, tightly-coupled processors. In this study the potential for speeding up CA execution using multi-core CPUs and GPUs is investigated and the scalability of doing so with respect to standard CA parameters such as lattice and neighbourhood sizes, number of states and generations is determined. Additionally the impact of ‘Activity’ (the number of ‘alive’ cells) within a given CA simulation is investigated in terms of both varying the random initial distribution levels of ‘alive’ cells, and via the use of novel state transition rules; where a change in the dynamics of these rules (i.e. the number of states) allows for the investigation of the variable complexity within.Engineering and Physical Sciences Research Council (EPSRC

    High Performance Matrix-Fee Method for Large-Scale Finite Element Analysis on Graphics Processing Units

    Get PDF
    This thesis presents a high performance computing (HPC) algorithm on graphics processing units (GPU) for large-scale numerical simulations. In particular, the research focuses on the development of an efficient matrix-free conjugate gradient solver for the acceleration and scalability of the steady-state heat transfer finite element analysis (FEA) on a three-dimension uniform structured hexahedral mesh using a voxel-based technique. One of the greatest challenges in large-scale FEA is the availability of computer memory for solving the linear system of equations. Like in large-scale heat transfer simulations, where the size of the system matrix assembly becomes very large, the FEA solver requires huge amounts of computational time and memory that very often exceed the actual memory limits of the available hardware resources. To overcome this problem a matrix-free conjugate gradient (MFCG) method is designed and implemented to finite element computations which avoids the global matrix assembly. The main difference of the MFCG to the classical conjugate gradient (CG) solver lies on the implementation of the matrix-vector product operation. Matrix-vector operation found to be the most expensive process consuming more than 80% out of the total computations for the numerical solution and thus a matrix-free matrix-vector (MFMV) approach becomes beneficial for saving memory and computational time throughout the execution of the FEA. In summary, the MFMV algorithm consists of three nested loops: (a) a loop over the mesh elements of the domain, (b) a loop on the element nodal values to perform the element matrix-vector operations and (c) the summation and transformation of the nodal values into their correct positions in the global index. A performance analysis on a serial and a parallel implementation on a GPU shows that the MFCG solver outperforms the classical CG consuming significantly lower amounts of memory allowing for much larger size simulations. The outcome of this study suggests that the MFCG can also speed-up and scale the execution of large-scale finite element simulations
    corecore