73 research outputs found
Rayleigh and Prandtl number scaling in the bulk of Rayleigh-Benard turbulence
The Rayleigh (Ra) and Prandtl (Pr) number scaling of the Nusselt number Nu,
the Reynolds number Re, the temperature fluctuations, and the kinetic and
thermal dissipation rates is studied for (numerical) homogeneous
Rayleigh-Benard turbulence, i.e., Rayleigh-Benard turbulence with periodic
boundary conditions in all directions and a volume forcing of the temperature
field by a mean gradient. This system serves as model system for the bulk of
Rayleigh-Benard flow and therefore as model for the so called ``ultimate regime
of thermal convection''. With respect to the Ra dependence of Nu and Re we
confirm our earlier results \cite{loh03} which are consistent with the
Kraichnan theory \cite{kra62} and the Grossmann-Lohse (GL) theory
\cite{gro00,gro01,gro02,gro04}, which both predict and . However the Pr dependence within these two theories is
different. Here we show that the numerical data are consistent with the GL
theory , . For the thermal and kinetic
dissipation rates we find \eps_\theta/(\kappa \Delta^{2}L^{-2}) \sim (Re
Pr)^{0.87} and \eps_u/(\nu^3 L^{-4}) \sim Re^{2.77}, also both consistent
with the GL theory, whereas the temperature fluctuations do not depend on Ra
and Pr. Finally, the dynamics of the heat transport is studied and put into the
context of a recent theoretical finding by Doering et al. \cite{doe05}.Comment: 8 pages, 9 figure
The Hypothesis of Superluminal Neutrinos: comparing OPERA with other Data
The OPERA Collaboration reported evidence for muonic neutrinos traveling
slightly faster than light in vacuum. While waiting further checks from the
experimental community, here we aim at exploring some theoretical consequences
of the hypothesis that muonic neutrinos are superluminal, considering in
particular the tachyonic and the Coleman-Glashow cases. We show that a
tachyonic interpretation is not only hardly reconciled with OPERA data on
energy dependence, but that it clashes with neutrino production from pion and
with neutrino oscillations. A Coleman-Glashow superluminal neutrino beam would
also have problems with pion decay kinematics for the OPERA setup; it could be
easily reconciled with SN1987a data, but then it would be very problematic to
account for neutrino oscillations.Comment: v1: 10 pages, 2 figures; v2: 12 pages, 2 figures, improved discussion
of CG case as for pion decay and neutrino oscillations, added reference
Optimization of lattice Boltzmann simulations on heterogeneous computers
High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper, we consider exactly this problem for a class of applications based on lattice Boltzmann methods, widely used in computational fluid dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using, as testbeds, HPC clusters incorporating different accelerators: Intel Xeon Phi many-core processors, NVIDIA GPUs, and AMD GPUs
Early Experience on Using Knights Landing Processors for Lattice Boltzmann Applications
The Knights Landing (KNL) is the codename for the latest generation of Intel
processors based on Intel Many Integrated Core (MIC) architecture. It relies on
massive thread and data parallelism, and fast on-chip memory. This processor
operates in standalone mode, booting an off-the-shelf Linux operating system.
The KNL peak performance is very high - approximately 3 Tflops in double
precision and 6 Tflops in single precision - but sustained performance depends
critically on how well all parallel features of the processor are exploited by
real-life applications. We assess the performance of this processor for Lattice
Boltzmann codes, widely used in computational fluid-dynamics. In our OpenMP
code we consider several memory data-layouts that meet the conflicting
computing requirements of distinct parts of the application, and sustain a
large fraction of peak performance. We make some performance comparisons with
other processors and accelerators, and also discuss the impact of the various
memory layouts on energy efficiency
Performance and portability of accelerated lattice Boltzmann applications with OpenACC
An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems have been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability, and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directives to mark regions of existing C, C++, or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper, we address precisely this issue, using as a test-bench a massively parallel lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated with portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the-art architectures
FFT for the APE Parallel Computer
We present a parallel FFT algorithm for SIMD systems following the `Transpose
Algorithm' approach. The method is based on the assignment of the data field
onto a 1-dimensional ring of systolic cells. The systolic array can be
universally mapped onto any parallel system. In particular for systems with
next-neighbour connectivity our method has the potential to improve the
efficiency of matrix transposition by use of hyper-systolic communication. We
have realized a scalable parallel FFT on the APE100/Quadrics massively parallel
computer, where our implementation is part of a 2-dimensional hydrodynamics
code for turbulence studies. A possible generalization to 4-dimensional FFT is
presented, having in mind QCD applications.Comment: 17 pages, 13 figures, figures include
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
The present panorama of HPC architectures is extremely heterogeneous, ranging
from traditional multi-core CPU processors, supporting a wide class of
applications but delivering moderate computing performance, to many-core GPUs,
exploiting aggressive data-parallelism and delivering higher performances for
streaming computing applications. In this scenario, code portability (and
performance portability) become necessary for easy maintainability of
applications; this is very relevant in scientific computing where code changes
are very frequent, making it tedious and prone to error to keep different code
versions aligned. In this work we present the design and optimization of a
state-of-the-art production-level LQCD Monte Carlo application, using the
directive-based OpenACC programming model. OpenACC abstracts parallel
programming to a descriptive level, relieving programmers from specifying how
codes should be mapped onto the target architecture. We describe the
implementation of a code fully written in OpenACC, and show that we are able to
target several different architectures, including state-of-the-art traditional
CPUs and GPUs, with the same code. We also measure performance, evaluating the
computing efficiency of our OpenACC code on several architectures, comparing
with GPU-specific implementations and showing that a good level of
performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for
consideration in International Journal of Modern Physics
Massively parallel lattice–Boltzmann codes on large GPU clusters
This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bottlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics
Portable multi-node LQCD Monte Carlo simulations using OpenACC
This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code
for staggered fermions, purposely designed to be portable across different
computer architectures, including GPUs and commodity CPUs. Portability is
achieved using the OpenACC parallel programming model, used to develop a code
that can be compiled for several processor architectures. The paper focuses on
parallelization on multiple computing nodes using OpenACC to manage parallelism
within the node, and OpenMPI to manage parallelism among the nodes. We first
discuss the available strategies to be adopted to maximize performances, we
then describe selected relevant details of the code, and finally measure the
level of performance and scaling-performance that we are able to achieve. The
work focuses mainly on GPUs, which offer a significantly high level of
performances for this application, but also compares with results measured on
other processors.Comment: 22 pages, 8 png figure
- …