11 research outputs found

    A Flexible Patch-Based Lattice Boltzmann Parallelization Approach for Heterogeneous GPU-CPU Clusters

    Full text link
    Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. In this article, this topic is addressed in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. We propose a multi-GPU implementation using a block-structured MPI parallelization, suitable for load balancing and heterogeneous computations on CPUs and GPUs. The overhead required for multi-GPU simulations is discussed in detail and it is demonstrated that the kernel performance can be sustained to a large extent. With our GPU implementation, we achieve nearly perfect weak scalability on InfiniBand clusters. However, in strong scaling scenarios multi-GPUs make less efficient use of the hardware than IBM BG/P and x86 clusters. Hence, a cost analysis must determine the best course of action for a particular simulation task. Additionally, weak scaling results of heterogeneous simulations conducted on CPUs and GPUs simultaneously are presented using clusters equipped with varying node configurations.Comment: 20 pages, 12 figure

    Exploring performance and power properties of modern multicore chips via simple machine models

    Full text link
    Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model to describe the single- and multi-core performance of streaming kernels. The model refines the well-known roofline model, since it can predict the scaling and the saturation behavior of bandwidth-limited loop kernels on a multicore chip. The saturation point is especially relevant for considerations of energy consumption. From power dissipation measurements of benchmark programs with vastly different requirements to the hardware, we derive a simple, phenomenological power model for the Sandy Bridge processor. Together with the ECM model, we are able to explain many peculiarities in the performance and power behavior of multicore processors, and derive guidelines for energy-efficient execution of parallel programs. Finally, we show that the ECM and power models can be successfully used to describe the scaling and power behavior of a lattice-Boltzmann flow solver code.Comment: 23 pages, 10 figures. Typos corrected, DOI adde

    Domain decomposition and locality optimization for large-scale lattice Boltzmann simulations

    Full text link
    We present a simple, parallel and distributed algorithm for setting up and partitioning a sparse representation of a regular discretized simulation domain. This method is scalable for a large number of processes even for complex geometries and ensures load balance between the domains, reasonable communication interfaces, and good data locality within the domain. Applying this scheme to a list-based lattice Boltzmann flow solver can achieve similar or even higher flow solver performance than widely used standard graph partition based tools such as METIS and PT-SCOTCH

    A New Approach to Reduce Memory Consumption in Lattice Boltzmann Method on GPU

    Get PDF
    Several efforts have been performed to improve LBM defects related to its computational performance. In this work, a new algorithm has been introduced to reduce memory consumption. In the past, most LBM developers have not paid enough attention to retain LBM simplicity in their modified version, while it has been one of the main concerns in developing of the present algorithm. Note, there is also a deficiency in our new algorithm. Besides the memory reduction, because of high memory call back from the main memory, some computational efficiency reduction occurs. To overcome this difficulty, an optimization approach has been introduced, which has recovered this efficiency to the original two-steps two-lattice LBM. This is accomplished by a trade-off between memory reduction and computational performance. To keep a suitable computational efficiency, memory reduction has reached to about 33% in D2Q9 and 42% in D3Q19. In addition, this approach has been implemented on graphical processing unit (GPU) as well. In regard to onboard memory limitation in GPU, the advantage of this new algorithm is enhanced even more (39% in D2Q9 and 45% in D3Q19). Note, because of higher memory bandwidth in GPU, computational performance of our new algorithm using GPU is better than CPU

    Density driven Natural Convection Heat Transfer in fully immersed Liquid-Cooled Data Centre Application

    Get PDF
    Data centres are developing at a rapid pace with the continued increase in digital demands. Data centre cooling and energy efficiency is a growing topic of interest that requires new engineering solutions. To achieve both better cooling and higher efficiency, liquid-cooled computer systems are being considered as one of the best solutions. Total liquid cooled computers are not new, but with the power densities required for supercomputers have seen resurgence in liquid cooling, in particular solutions that do not require the use of air as a cooling medium. Recently the industry has developed an advanced fully immersed liquid-cooled data centre solution to fulfil this purpose. The core technology of the design is a liquid-cooled computer node (first cooling stage), which relies on density-driven, natural convection that has challenging engineering requirements. This thesis looks at the density-driven, natural convection from a different angle by simplifying the Navier-Stokes equations and Convection-Diffusion equation leading to the development of a Constant Thermal Gradient (CTG) model to solve the natural convection flow analytically. The CTG model yields algebraic solutions for velocity and temperature profiles, thereby it is able to give the flow characteristic length (l*) and indicate the boundary layer thickness directly. The development and usage of the CTG model is the academic achievement in this thesis, and it provides a clearer understanding of natural convection mechanism. This thesis also uses CFD simulation (ANSYS CFX) and laboratory experiment to analyse the heat transfer performance of the liquid-cooled system. A group of CFD simulations of a cavity convection problem has been carried out to find the appropriate approximation factor for the CTG model, hence completing the CTG model and make it ready for further analysis. A full scale CFD simulation has also been carried out to analyse the first cooling stage of the system for a given condition, and a real computer system has also been tested under the same condition. Then a three-step research work-flow has been developed to do heat transfer analysis on a natural convection based liquid-cooled system: CTG model, CFD simulation and experimental test. This thermal analysis work flow provides a knowledge base for further improvement in cooling design of the system, and this is the engineering achievement of this thesis. In order to see the thermal advantages of the fully-immersed liquid-cooled system, other intense real-world tests on the liquid-cooled system have been carried out. One of which is a benchmark test between an advanced back-door water cooled system and a fully-immersed liquid-cooled system; and such benchmark proves the thermal benefit of the fully liquid-cooled solution. The other benchmark is a series of real-world tests on a fully immersed liquid-cooled system which aim to achieve the ASHRAE W5 standard, and it proves the practicality of the liquid-cooled solution. The benchmark test in this thesis was published in the Semi-Therm conference
    corecore