13 research outputs found
Faster arithmetic for number-theoretic transforms
We show how to improve the efficiency of the computation of fast Fourier
transforms over F_p where p is a word-sized prime. Our main technique is
optimisation of the basic arithmetic, in effect decreasing the total number of
reductions modulo p, by making use of a redundant representation for integers
modulo p. We give performance results showing a significant improvement over
Shoup's NTL library.Comment: 9 pages, a few minor changes and reorganisation, to appear in JS
Efficient long division via Montgomery multiply
We present a novel right-to-left long division algorithm based on the
Montgomery modular multiply, consisting of separate highly efficient loops with
simply carry structure for computing first the remainder (x mod q) and then the
quotient floor(x/q). These loops are ideally suited for the case where x
occupies many more machine words than the divide modulus q, and are strictly
linear time in the "bitsize ratio" lg(x)/lg(q). For the paradigmatic
performance test of multiword dividend and single 64-bit-word divisor,
exploitation of the inherent data-parallelism of the algorithm effectively
mitigates the long latency of hardware integer MUL operations, as a result of
which we are able to achieve respective costs for remainder-only and full-DIV
(remainder and quotient) of 6 and 12.5 cycles per dividend word on the Intel
Core 2 implementation of the x86_64 architecture, in single-threaded execution
mode. We further describe a simple "bit-doubling modular inversion" scheme,
which allows the entire iterative computation of the mod-inverse required by
the Montgomery multiply at arbitrarily large precision to be performed with
cost less than that of a single Newtonian iteration performed at the full
precision of the final result. We also show how the Montgomery-multiply-based
powering can be efficiently used in Mersenne and Fermat-number trial
factorization via direct computation of a modular inverse power of 2, without
any need for explicit radix-mod scalings.Comment: 23 pages; 8 tables v2: Tweak formatting, pagecount -= 2. v3: Fix
incorrect powers of R in formulae [7] and [11] v4: Add Eldridge & Walter ref.
v5: Clarify relation between Algos A/A',D and Hensel-div; clarify
true-quotient mechanics; Add Haswell timings, refs to Agner Fog timings pdf
and GMP asm-timings ref-page. v6: Remove stray +bw in MULL line of Algo D
listing; add note re byte-LUT for qinv_
NFLlib: NTT-based Fast Lattice Library
International audienceRecent years have witnessed an increased interest in lattice cryptography. Besides its strong security guarantees, its simplicity and versatility make this powerful theoretical tool a promising competitive alternative to classical cryptographic schemes. In this paper, we introduce NFLlib, an efficient and open-source C++ library dedicated to ideal lattice cryptography in the widely-spread polynomial ring Zp[x]/(x n + 1) for n a power of 2. The library combines al-gorithmic optimizations (Chinese Remainder Theorem, optimized Number Theoretic Transform) together with programming optimization techniques (SSE and AVX2 specializations, C++ expression templates, etc.), and will be fully available under the GPL license. The library compares very favorably to other libraries used in ideal lattice cryptography implementations (namely the generic number theory libraries NTL and flint implementing polynomial arithmetic, and the optimized library for lattice homomorphic encryption HElib): restricting the library to the aforementioned polynomial ring allows to gain several orders of magnitude in efficiency
High performance SIMD modular arithmetic for polynomial evaluation
Two essential problems in Computer Algebra, namely polynomial factorization
and polynomial greatest common divisor computation, can be efficiently solved
thanks to multiple polynomial evaluations in two variables using modular
arithmetic. In this article, we focus on the efficient computation of such
polynomial evaluations on one single CPU core. We first show how to leverage
SIMD computing for modular arithmetic on AVX2 and AVX-512 units, using both
intrinsics and OpenMP compiler directives. Then we manage to increase the
operational intensity and to exploit instruction-level parallelism in order to
increase the compute efficiency of these polynomial evaluations. All this
results in the end to performance gains up to about 5x on AVX2 and 10x on
AVX-512
Guess & Sketch: Language Model Guided Transpilation
Maintaining legacy software requires many software and systems engineering
hours. Assembly code programs, which demand low-level control over the computer
machine state and have no variable names, are particularly difficult for humans
to analyze. Existing conventional program translators guarantee correctness,
but are hand-engineered for the source and target programming languages in
question. Learned transpilation, i.e. automatic translation of code, offers an
alternative to manual re-writing and engineering efforts. Automated symbolic
program translation approaches guarantee correctness but struggle to scale to
longer programs due to the exponentially large search space. Their rigid
rule-based systems also limit their expressivity, so they can only reason about
a reduced space of programs. Probabilistic neural language models (LMs) produce
plausible outputs for every input, but do so at the cost of guaranteed
correctness. In this work, we leverage the strengths of LMs and symbolic
solvers in a neurosymbolic approach to learned transpilation for assembly code.
Assembly code is an appropriate setting for a neurosymbolic approach, since
assembly code can be divided into shorter non-branching basic blocks amenable
to the use of symbolic methods. Guess & Sketch extracts alignment and
confidence information from features of the LM then passes it to a symbolic
solver to resolve semantic equivalence of the transpilation input and output.
We test Guess & Sketch on three different test sets of assembly transpilation
tasks, varying in difficulty, and show that it successfully transpiles 57.6%
more examples than GPT-4 and 39.6% more examples than an engineered transpiler.
We also share a training and evaluation dataset for this task
Optimized Binary64 and Binary128 Arithmetic with GNU MPFR
International audienceWe describe algorithms used to optimize the GNU MPFR library when the operands fit into one or two words. On modern processors, a correctly rounded addition of two quadruple precision numbers is now performed in 22 cycles, a subtraction in 24 cycles, a multiplication in 32 cycles, a division in 64 cycles, and a square root in 69 cycles. We also introduce a new faithful rounding mode, which enables even faster computations. Those optimizations will be available in version 4 of MPFR
Алгоритм ділення з використанням системи числення з основою RADIX16 для формування цифр частки
Актуальність. Всі ми кожного дня користуємося комп’ютерами, смартфонами та іншими пристроями, в основі яких лежить центральний мікропроцесор. Він виконує безліч операцій за одну секунду. Ми цього навіть не помічаємо, але деякі операції займають більше часу, ніж інші. Їх прискорення на декілька відсотків не буде помітним для звичайного користувача, але в межах людства може зекономити значну кількість часу. Однією з таких операцій є арифметична операція ділення. Нажаль, окрім безпосередньо розробників архітектур мікропроцесорів ніхто більше не знає за яким саме алгоритмом відбувається операція ділення в сучасних мікропроцесорах. Тому розробка загальнодоступного алгоритму є актуальною задачею. Мета дослідження. Метою магістерської дисертації є розробка алгоритму ділення з використанням системи числення з основою RADIX16, що міг би використовуватися для побудови ядра сучасного мікропроцесора. Об’єкт дослідження – арифметичні операції з числами в сучасних мікропроцесорах. Предмет дослідження – алгоритм знаходження цифр частки в сучасних мікропроцесорах. Методи досліджень. В магістерській роботі використано методи аналізу та проектування алгоритмів. Проведене дослідження дає можливість використання розробленого алгоритму для отримання цифр частки (виконання операції ділення). Практична цінність. Отримані результати можуть використовуватися у майбутніх дослідженнях за напрямками: вдосконалення алгоритмів ділення; робота з числами в системі числення з основою RADIX16
Algortihm Optimization Using SIMD Instructions
Tato práce popisuje a porovnává techniky použitelné pro optimalizaci algoritmů převážně z hlediska zkrácení výpočetní doby. Pro demonstraci praktik byly vybrány algoritmy z rozdílných oblastí a to -- optimalizace hejnem částic, algoritmus pro vykreslování kružnic a algoritmus pro otočení obrázku (matice). Tyto algoritmy byly implementovány v jazyce Python 3, C a jazyce symbolických adres s využitím SIMD technologie. Při psaní kódu byl kladen důraz na co nejefektivnější implementaci algoritmu. V této práci jsou tyto praktiky popsáný a porovnány, stejně tak jako jejich účinek na optimalizaci algoritmů. Provedené testy potvrdily velký potenciál SIMD technologií pro optimalizace, ale také to, že tento přístup není možný využít na všechny algoritmy. V případě optimalizace algoritmu pro vykreslování kružnic dosahovala SIMD implementace více jak desetinásobné rychlosti než sériová implementace v jazyce C a více jak tisíckrát vyšší rychlost než implementace v jazyce Python 3. V případě algoritmu optimalizace hejnem částic byla však implementace v jazyce C rychlejší než SIMD implementace algoritmu.This thesis talks about techniques which can be used to optimize run time of algorithms. For a demonstration of these techniques algorithms from different fields were chosen, namely particle swarm optimization, circle drawing algorithm and image (matrix) rotation algorithm. These algorithms were written in Python 3, C language and assembly language using SIMD instructions. While writing these codes emphases was placed on code efficiency. These practices were in this thesis described and compared, same as the impact on algorithm optimization. Performed tests upheld expected potential of SIMD technology for optimization, but also that this approach cannot be used in all cases. In case of circle drawing the SIMD approach achieved more than ten times better speeds than the serial implementation in C and more than one thousand times better speed than Python 3 implementation. In case of particle swarm optimization the result was opposite -- serial C implementation achieved a better speed than SIMD implementation.
Computer Science for Continuous Data:Survey, Vision, Theory, and Practice of a Computer Analysis System
Building on George Boole's work, Logic provides a rigorous foundation for the powerful tools in Computer Science that underlie nowadays ubiquitous processing of discrete data, such as strings or graphs. Concerning continuous data, already Alan Turing had applied "his" machines to formalize and study the processing of real numbers: an aspect of his oeuvre that we transform from theory to practice.The present essay surveys the state of the art and envisions the future of Computer Science for continuous data: natively, beyond brute-force discretization, based on and guided by and extending classical discrete Computer Science, as bridge between Pure and Applied Mathematics