3 research outputs found
Direct N-Body problem optimisation using the AVX-512 instruction set
The integration of the equations of motion of N interacting particles,
represents a classical problem in many branches of physics and chemistry. The
direct N-body problem is at the heart of simulations studying Coulomb Crystals.
We present an hand-optimized code for the latest AVX-512 set of instructions
that achieve a single core speed up of respect the version
optimized by the compiler. The increase performance is due a optimization on
the organization of the memory access on the inner loop on the Coulomb and,
specially, on the usage of an intrinsic function to faster compute the
. Our parallelization, which is implemented in OpenMP, achieves an
excellent scalability with the number of cores. In total, we achieve using a just a standard WorkStation with one Intel Skylake CPU (10
cores). It represents of the theoretical maximum number of
double precision FLOPS corresponding to Fused Multiplication Addition (FMA)
operations
A scaling-less Newton-Raphson pipelined implementation for a fixed-point inverse square root operator
International audienceThe inverse square root is a common operation in digital signal processing architectures, in particular when matrix inversions are required. The Newton-Raphson algorithm is usually used, either in floating or in fixed-point formats. With the former format, the well-known fast inverse square root computation is based on a 32-bit integer constant, which is allowed by the standardized format of the mantissa. For the fixed-point format, there are many possibilities, which usually force a design with scaling of the input in order to respect a pre-determined work range. Having the input in a known range makes it possible to compute a first approximation with coefficients stored in memory. In this paper, a novel generic architecture which does not require scaling is proposed. This design is totally pipelined, ROM-less and can be directly used in any architecture. The implementation is optimized to reach the maximum clock frequency offered by the DSP cells of Xilinx FPGAs. This frequency is higher than the one available by using memory blocks