10 research outputs found

    Approaches for the Parallelization of Software Implementation of Integer Multiplication

    Get PDF
    In this paper there are considered several approaches for the increasing performance of software implementation of integer multiplication algorithm for the 32-bit & 64-bit platforms via parallelization. The main idea of algorithm parallelization consists in delayed carry mechanism using which authors have proposed earlier [11]. The delayed carry allows to get rid of connectivity in loop iterations for sums accumulation of products, which allows parallel execution of loops iterations in separate threads. Upon completion of sum accumulation threads, it is necessary to make corrections in final result via assimilation of carries. First approach consists in optimization of parallelization for the two execution threads and second approach is an evolution of the first approach and is oriented on three and more execution threads. Proposed approaches for parallelization allow increasing the total algorithm computational complexity, as for one execution thread, but decrease total execution time on multi-core CPU

    ПОДХОДЫ К ПОВЫШЕНИЮ ПРОИЗВОДИТЕЛЬНОСТИ ПРОГРАММНОЙ РЕАЛИЗАЦИИ ОПЕРАЦИИ УМНОЖЕНИЯ В ПОЛЕ ЦЕЛЫХ ЧИСЕЛ

    Get PDF
    Авторами предлагается подход к увеличению производительности программной реализации алгоритма умножения в поле чисел для 32-х и 64-х разрядных платформ, который состоит в использовании механизма отложенного учета переноса из старшего разряда при накоплении суммы, что позволяет избежать необходимости учета переноса из старшего разряда на каждой итерации цикла накопления суммы. Отложенный перенос дает возможность уменьшить общее число операций сложения и эффективно применять существующие технологии распараллеливания

    Approaches for the performance increasing of software implementation of integer multiplication in prime fields

    Get PDF
    Authors have proposed the approach to increase performance of software implementation of finite field multiplication algorithm, for 32-bit and 64-bit platforms. The approach is based on delayed carry mechanism of significant bit in sum accumulating. This allows to avoid the requirement of taking into account the significant bit carry at the each iteration of the sum accumulation loop. The delayed carry mechanism reduces the total number of additions and gives the opportunity to apply the modern parallelization technologies

    Performance Increasing Approaches For Binary Field Inversion

    Get PDF
    Authors propose several approaches for increasing performance of multiplicative inversion algorithm in binary fields based on Extended Euclidean Algorithm (EEA). First approach is based on Extended Euclidean Algorithm specificity: either invariant polynomial u remains intact or swaps with invariant polynomial v. It makes it possible to avoid necessity of polynomial v degree computing. The second approach is based on searching the next matching index when calculating the degree of the polynomial, since degree polynomial invariant u at least decreases by 1, then it is possible to use current value while further calculation the degree of the polynomial

    Techniques for Performance Improvement of Integer Multiplication in Cryptographic Applications

    Get PDF
    The problem of arithmetic operations performance in number fields is actively researched by many scientists, as evidenced by significant publications in this field. In this work, we offer some techniques to increase performance of software implementation of finite field multiplication algorithm, for both 32-bit and 64-bit platforms. The developed technique, called “delayed carry mechanism,” allows to preventing necessity to consider a significant bit carry at each iteration of the sum accumulation loop. This mechanism enables reducing the total number of additions and applies the modern parallelization technologies effectively

    Vectorizing and distributing number-theoretic transform to count Goldbach partitions on Arm-based supercomputers

    Get PDF
    In this article, we explore the usage of scalable vector extension (SVE) to vectorize number-theoretic transforms (NTTs). In particular, we show that 64-bit modular arithmetic operations, including modular multiplication, can be efficiently implemented with SVE instructions. The vectorization of NTT loops and kernels involving 64-bit modular operations was not possible in previous Arm-based single instruction multiple data architectures since these architectures lacked crucial instructions to efficiently implement modular multiplication. We test and evaluate our SVE implementation on the A64FX processor in an HPE Apollo 80 system. Furthermore, we implement a distributed NTT for the computation of large-scale exact integer convolutions. We evaluate this transform on HPE Apollo 70, Cray XC50, HPE Apollo 80, and HPE Cray EX systems, where we demonstrate good scalability to thousands of cores. Finally, we describe how these methods can be utilized to count the number of Goldbach partitions of all even numbers to large limits. We present some preliminary results concerning this problem, in particular a histogram of the number of Goldbach partitions of the even numbers up to 2 40.</p

    Modular SIMD arithmetic in Mathemagix

    Full text link
    Modular integer arithmetic occurs in many algorithms for computer algebra, cryptography, and error correcting codes. Although recent microprocessors typically offer a wide range of highly optimized arithmetic functions, modular integer operations still require dedicated implementations. In this article, we survey existing algorithms for modular integer arithmetic, and present detailed vectorized counterparts. We also present several applications, such as fast modular Fourier transforms and multiplication of integer polynomials and matrices. The vectorized algorithms have been implemented in C++ inside the free computer algebra and analysis system Mathemagix. The performance of our implementation is illustrated by various benchmarks

    Comparison of Modular Arithmetic Algorithms on GPUs

    No full text
    International audienceWe present below our first implementation results on a modular arithmetic library on GPUs for cryptography. Our library, in C++ for CUDA, provides modular arithmetic, finite field arithmetic and some ECC support. Several algorithms and memory coding styles have been compared: local, shared and register. For moderate sizes, we report up to 2.6 speedup compared to state-of-the-art library
    corecore