10 research outputs found
Approaches for the Parallelization of Software Implementation of Integer Multiplication
In this paper there are considered several approaches for the increasing performance of software implementation of integer multiplication algorithm for the 32-bit & 64-bit platforms via parallelization. The main idea of algorithm parallelization consists in delayed carry mechanism using which authors have proposed earlier [11]. The delayed carry allows to get rid of connectivity in loop iterations for sums accumulation of products, which allows parallel execution of loops iterations in separate threads. Upon completion of sum accumulation threads, it is necessary to make corrections in final result via assimilation of carries. First approach consists in optimization of parallelization for the two execution threads and second approach is an evolution of the first approach and is oriented on three and more execution threads. Proposed approaches for parallelization allow increasing the total algorithm computational complexity, as for one execution thread, but decrease total execution time on multi-core CPU
ПОДХОДЫ К ПОВЫШЕНИЮ ПРОИЗВОДИТЕЛЬНОСТИ ПРОГРАММНОЙ РЕАЛИЗАЦИИ ОПЕРАЦИИ УМНОЖЕНИЯ В ПОЛЕ ЦЕЛЫХ ЧИСЕЛ
Авторами предлагается подход к увеличению производительности программной реализации алгоритма умножения в поле чисел для 32-х и 64-х разрядных платформ, который состоит в использовании механизма отложенного учета переноса из старшего разряда при накоплении суммы, что позволяет избежать необходимости учета переноса из старшего разряда на каждой итерации цикла накопления суммы. Отложенный перенос дает возможность уменьшить общее число операций сложения и эффективно применять существующие технологии распараллеливания
Approaches for the performance increasing of software implementation of integer multiplication in prime fields
Authors have proposed the approach to increase performance of software implementation of finite field multiplication algorithm, for 32-bit and 64-bit platforms. The approach is based on delayed carry mechanism of significant bit in sum accumulating. This allows to avoid the requirement of taking into account the significant bit carry at the each iteration of the sum accumulation loop. The delayed carry mechanism reduces the total number of additions and gives the opportunity to apply the modern parallelization technologies
Performance Increasing Approaches For Binary Field Inversion
Authors propose several approaches for increasing performance of multiplicative inversion algorithm in binary fields based on Extended Euclidean Algorithm (EEA). First approach is based on Extended Euclidean Algorithm specificity: either invariant polynomial u remains intact or swaps with invariant polynomial v. It makes it possible to avoid necessity of polynomial v degree computing. The second approach is based on searching the next matching index when calculating the degree of the polynomial, since degree polynomial invariant u at least decreases by 1, then it is possible to use current value while further calculation the degree of the polynomial
Techniques for Performance Improvement of Integer Multiplication in Cryptographic Applications
The problem of arithmetic operations performance in number fields is actively researched by many scientists, as evidenced by significant publications in this field. In this work, we offer some techniques to increase performance of software implementation of finite field multiplication algorithm, for both 32-bit and 64-bit platforms. The developed technique, called “delayed carry mechanism,” allows to preventing necessity to consider a significant bit carry at each iteration of the sum accumulation loop. This mechanism enables reducing the total number of additions and applies the modern parallelization technologies effectively
Vectorizing and distributing number-theoretic transform to count Goldbach partitions on Arm-based supercomputers
In this article, we explore the usage of scalable vector extension (SVE) to vectorize number-theoretic transforms (NTTs). In particular, we show that 64-bit modular arithmetic operations, including modular multiplication, can be efficiently implemented with SVE instructions. The vectorization of NTT loops and kernels involving 64-bit modular operations was not possible in previous Arm-based single instruction multiple data architectures since these architectures lacked crucial instructions to efficiently implement modular multiplication. We test and evaluate our SVE implementation on the A64FX processor in an HPE Apollo 80 system. Furthermore, we implement a distributed NTT for the computation of large-scale exact integer convolutions. We evaluate this transform on HPE Apollo 70, Cray XC50, HPE Apollo 80, and HPE Cray EX systems, where we demonstrate good scalability to thousands of cores. Finally, we describe how these methods can be utilized to count the number of Goldbach partitions of all even numbers to large limits. We present some preliminary results concerning this problem, in particular a histogram of the number of Goldbach partitions of the even numbers up to 2 40.</p
Modular SIMD arithmetic in Mathemagix
Modular integer arithmetic occurs in many algorithms for computer algebra,
cryptography, and error correcting codes. Although recent microprocessors
typically offer a wide range of highly optimized arithmetic functions, modular
integer operations still require dedicated implementations. In this article, we
survey existing algorithms for modular integer arithmetic, and present detailed
vectorized counterparts. We also present several applications, such as fast
modular Fourier transforms and multiplication of integer polynomials and
matrices. The vectorized algorithms have been implemented in C++ inside the
free computer algebra and analysis system Mathemagix. The performance of our
implementation is illustrated by various benchmarks
Comparison of Modular Arithmetic Algorithms on GPUs
International audienceWe present below our first implementation results on a modular arithmetic library on GPUs for cryptography. Our library, in C++ for CUDA, provides modular arithmetic, finite field arithmetic and some ECC support. Several algorithms and memory coding styles have been compared: local, shared and register. For moderate sizes, we report up to 2.6 speedup compared to state-of-the-art library