Search CORE

3 research outputs found

Direct N-Body problem optimisation using the AVX-512 instruction set

Author: Dempsey Jim
Pedregosa-Gutierrez Jofre
Publication venue
Publication date: 21/06/2021
Field of study

The integration of the equations of motion of N interacting particles, represents a classical problem in many branches of physics and chemistry. The direct N-body problem is at the heart of simulations studying Coulomb Crystals. We present an hand-optimized code for the latest AVX-512 set of instructions that achieve a single core speed up of

\approx 340\%

respect the version optimized by the compiler. The increase performance is due a optimization on the organization of the memory access on the inner loop on the Coulomb and, specially, on the usage of an intrinsic function to faster compute the

1/\sqrt{x}

. Our parallelization, which is implemented in OpenMP, achieves an excellent scalability with the number of cores. In total, we achieve

\approx 500GFLOPS

using a just a standard WorkStation with one Intel Skylake CPU (10 cores). It represents

\approx 75\%

of the theoretical maximum number of double precision FLOPS corresponding to Fused Multiplication Addition (FMA) operations

arXiv.org e-Print Archive

HAL AMU

HAL Descartes

Hal-Diderot

A scaling-less Newton-Raphson pipelined implementation for a fixed-point inverse square root operator

Author: Andriulli Francesco
Arzel Matthieu
Lahuec Cyril
Libessart Erwan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/06/2017
Field of study

International audienceThe inverse square root is a common operation in digital signal processing architectures, in particular when matrix inversions are required. The Newton-Raphson algorithm is usually used, either in floating or in fixed-point formats. With the former format, the well-known fast inverse square root computation is based on a 32-bit integer constant, which is allowed by the standardized format of the mantissa. For the fixed-point format, there are many possibilities, which usually force a design with scaling of the input in order to respect a pre-determined work range. Having the input in a known range makes it possible to compute a first approximation with coefficients stored in memory. In this paper, a novel generic architecture which does not require scaling is proposed. This design is totally pipelined, ROM-less and can be directly used in any architecture. The implementation is optimized to reach the maximum clock frequency offered by the DSP cells of Xilinx FPGAs. This frequency is higher than the one available by using memory blocks

Crossref

HAL-Université de Bretagne Occidentale

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)