3 research outputs found

    Direct N-Body problem optimisation using the AVX-512 instruction set

    Full text link
    The integration of the equations of motion of N interacting particles, represents a classical problem in many branches of physics and chemistry. The direct N-body problem is at the heart of simulations studying Coulomb Crystals. We present an hand-optimized code for the latest AVX-512 set of instructions that achieve a single core speed up of ≈340%\approx 340\% respect the version optimized by the compiler. The increase performance is due a optimization on the organization of the memory access on the inner loop on the Coulomb and, specially, on the usage of an intrinsic function to faster compute the 1/x1/\sqrt{x}. Our parallelization, which is implemented in OpenMP, achieves an excellent scalability with the number of cores. In total, we achieve ≈500GFLOPS\approx 500GFLOPS using a just a standard WorkStation with one Intel Skylake CPU (10 cores). It represents ≈75%\approx 75\% of the theoretical maximum number of double precision FLOPS corresponding to Fused Multiplication Addition (FMA) operations

    A scaling-less Newton-Raphson pipelined implementation for a fixed-point inverse square root operator

    No full text
    International audienceThe inverse square root is a common operation in digital signal processing architectures, in particular when matrix inversions are required. The Newton-Raphson algorithm is usually used, either in floating or in fixed-point formats. With the former format, the well-known fast inverse square root computation is based on a 32-bit integer constant, which is allowed by the standardized format of the mantissa. For the fixed-point format, there are many possibilities, which usually force a design with scaling of the input in order to respect a pre-determined work range. Having the input in a known range makes it possible to compute a first approximation with coefficients stored in memory. In this paper, a novel generic architecture which does not require scaling is proposed. This design is totally pipelined, ROM-less and can be directly used in any architecture. The implementation is optimized to reach the maximum clock frequency offered by the DSP cells of Xilinx FPGAs. This frequency is higher than the one available by using memory blocks
    corecore