13 research outputs found

    Faster arithmetic for number-theoretic transforms

    Full text link
    We show how to improve the efficiency of the computation of fast Fourier transforms over F_p where p is a word-sized prime. Our main technique is optimisation of the basic arithmetic, in effect decreasing the total number of reductions modulo p, by making use of a redundant representation for integers modulo p. We give performance results showing a significant improvement over Shoup's NTL library.Comment: 9 pages, a few minor changes and reorganisation, to appear in JS

    Efficient long division via Montgomery multiply

    Full text link
    We present a novel right-to-left long division algorithm based on the Montgomery modular multiply, consisting of separate highly efficient loops with simply carry structure for computing first the remainder (x mod q) and then the quotient floor(x/q). These loops are ideally suited for the case where x occupies many more machine words than the divide modulus q, and are strictly linear time in the "bitsize ratio" lg(x)/lg(q). For the paradigmatic performance test of multiword dividend and single 64-bit-word divisor, exploitation of the inherent data-parallelism of the algorithm effectively mitigates the long latency of hardware integer MUL operations, as a result of which we are able to achieve respective costs for remainder-only and full-DIV (remainder and quotient) of 6 and 12.5 cycles per dividend word on the Intel Core 2 implementation of the x86_64 architecture, in single-threaded execution mode. We further describe a simple "bit-doubling modular inversion" scheme, which allows the entire iterative computation of the mod-inverse required by the Montgomery multiply at arbitrarily large precision to be performed with cost less than that of a single Newtonian iteration performed at the full precision of the final result. We also show how the Montgomery-multiply-based powering can be efficiently used in Mersenne and Fermat-number trial factorization via direct computation of a modular inverse power of 2, without any need for explicit radix-mod scalings.Comment: 23 pages; 8 tables v2: Tweak formatting, pagecount -= 2. v3: Fix incorrect powers of R in formulae [7] and [11] v4: Add Eldridge & Walter ref. v5: Clarify relation between Algos A/A',D and Hensel-div; clarify true-quotient mechanics; Add Haswell timings, refs to Agner Fog timings pdf and GMP asm-timings ref-page. v6: Remove stray +bw in MULL line of Algo D listing; add note re byte-LUT for qinv_

    NFLlib: NTT-based Fast Lattice Library

    Get PDF
    International audienceRecent years have witnessed an increased interest in lattice cryptography. Besides its strong security guarantees, its simplicity and versatility make this powerful theoretical tool a promising competitive alternative to classical cryptographic schemes. In this paper, we introduce NFLlib, an efficient and open-source C++ library dedicated to ideal lattice cryptography in the widely-spread polynomial ring Zp[x]/(x n + 1) for n a power of 2. The library combines al-gorithmic optimizations (Chinese Remainder Theorem, optimized Number Theoretic Transform) together with programming optimization techniques (SSE and AVX2 specializations, C++ expression templates, etc.), and will be fully available under the GPL license. The library compares very favorably to other libraries used in ideal lattice cryptography implementations (namely the generic number theory libraries NTL and flint implementing polynomial arithmetic, and the optimized library for lattice homomorphic encryption HElib): restricting the library to the aforementioned polynomial ring allows to gain several orders of magnitude in efficiency

    High performance SIMD modular arithmetic for polynomial evaluation

    Full text link
    Two essential problems in Computer Algebra, namely polynomial factorization and polynomial greatest common divisor computation, can be efficiently solved thanks to multiple polynomial evaluations in two variables using modular arithmetic. In this article, we focus on the efficient computation of such polynomial evaluations on one single CPU core. We first show how to leverage SIMD computing for modular arithmetic on AVX2 and AVX-512 units, using both intrinsics and OpenMP compiler directives. Then we manage to increase the operational intensity and to exploit instruction-level parallelism in order to increase the compute efficiency of these polynomial evaluations. All this results in the end to performance gains up to about 5x on AVX2 and 10x on AVX-512

    Guess & Sketch: Language Model Guided Transpilation

    Full text link
    Maintaining legacy software requires many software and systems engineering hours. Assembly code programs, which demand low-level control over the computer machine state and have no variable names, are particularly difficult for humans to analyze. Existing conventional program translators guarantee correctness, but are hand-engineered for the source and target programming languages in question. Learned transpilation, i.e. automatic translation of code, offers an alternative to manual re-writing and engineering efforts. Automated symbolic program translation approaches guarantee correctness but struggle to scale to longer programs due to the exponentially large search space. Their rigid rule-based systems also limit their expressivity, so they can only reason about a reduced space of programs. Probabilistic neural language models (LMs) produce plausible outputs for every input, but do so at the cost of guaranteed correctness. In this work, we leverage the strengths of LMs and symbolic solvers in a neurosymbolic approach to learned transpilation for assembly code. Assembly code is an appropriate setting for a neurosymbolic approach, since assembly code can be divided into shorter non-branching basic blocks amenable to the use of symbolic methods. Guess & Sketch extracts alignment and confidence information from features of the LM then passes it to a symbolic solver to resolve semantic equivalence of the transpilation input and output. We test Guess & Sketch on three different test sets of assembly transpilation tasks, varying in difficulty, and show that it successfully transpiles 57.6% more examples than GPT-4 and 39.6% more examples than an engineered transpiler. We also share a training and evaluation dataset for this task

    Optimized Binary64 and Binary128 Arithmetic with GNU MPFR

    Get PDF
    International audienceWe describe algorithms used to optimize the GNU MPFR library when the operands fit into one or two words. On modern processors, a correctly rounded addition of two quadruple precision numbers is now performed in 22 cycles, a subtraction in 24 cycles, a multiplication in 32 cycles, a division in 64 cycles, and a square root in 69 cycles. We also introduce a new faithful rounding mode, which enables even faster computations. Those optimizations will be available in version 4 of MPFR

    Алгоритм ділення з використанням системи числення з основою RADIX16 для формування цифр частки

    Get PDF
    Актуальність. Всі ми кожного дня користуємося комп’ютерами, смартфонами та іншими пристроями, в основі яких лежить центральний мікропроцесор. Він виконує безліч операцій за одну секунду. Ми цього навіть не помічаємо, але деякі операції займають більше часу, ніж інші. Їх прискорення на декілька відсотків не буде помітним для звичайного користувача, але в межах людства може зекономити значну кількість часу. Однією з таких операцій є арифметична операція ділення. Нажаль, окрім безпосередньо розробників архітектур мікропроцесорів ніхто більше не знає за яким саме алгоритмом відбувається операція ділення в сучасних мікропроцесорах. Тому розробка загальнодоступного алгоритму є актуальною задачею. Мета дослідження. Метою магістерської дисертації є розробка алгоритму ділення з використанням системи числення з основою RADIX16, що міг би використовуватися для побудови ядра сучасного мікропроцесора. Об’єкт дослідження – арифметичні операції з числами в сучасних мікропроцесорах. Предмет дослідження – алгоритм знаходження цифр частки в сучасних мікропроцесорах. Методи досліджень. В магістерській роботі використано методи аналізу та проектування алгоритмів. Проведене дослідження дає можливість використання розробленого алгоритму для отримання цифр частки (виконання операції ділення). Практична цінність. Отримані результати можуть використовуватися у майбутніх дослідженнях за напрямками: вдосконалення алгоритмів ділення; робота з числами в системі числення з основою RADIX16

    Algortihm Optimization Using SIMD Instructions

    Get PDF
    Tato práce popisuje a porovnává techniky použitelné pro optimalizaci algoritmů převážně z hlediska zkrácení výpočetní doby. Pro demonstraci praktik byly vybrány algoritmy z rozdílných oblastí a to -- optimalizace hejnem částic, algoritmus pro vykreslování kružnic a algoritmus pro otočení obrázku (matice). Tyto algoritmy byly implementovány v jazyce Python 3, C a jazyce symbolických adres s využitím SIMD technologie. Při psaní kódu byl kladen důraz na co nejefektivnější implementaci algoritmu. V této práci jsou tyto praktiky popsáný a porovnány, stejně tak jako jejich účinek na optimalizaci algoritmů. Provedené testy potvrdily velký potenciál SIMD technologií pro optimalizace, ale také to, že tento přístup není možný využít na všechny algoritmy. V případě optimalizace algoritmu pro vykreslování kružnic dosahovala SIMD implementace více jak desetinásobné rychlosti než sériová implementace v jazyce C a více jak tisíckrát vyšší rychlost než implementace v jazyce Python 3. V případě algoritmu optimalizace hejnem částic byla však implementace v jazyce C rychlejší než SIMD implementace algoritmu.This thesis talks about techniques which can be used to optimize run time of algorithms. For a demonstration of these techniques algorithms from different fields were chosen, namely particle swarm optimization, circle drawing algorithm and image (matrix) rotation algorithm. These algorithms were written in Python 3, C language and assembly language using SIMD instructions. While writing these codes emphases was placed on code efficiency. These practices were in this thesis described and compared, same as the impact on algorithm optimization. Performed tests upheld expected potential of SIMD technology for optimization, but also that this approach cannot be used in all cases. In case of circle drawing the SIMD approach achieved more than ten times better speeds than the serial implementation in C and more than one thousand times better speed than Python 3 implementation. In case of particle swarm optimization the result was opposite -- serial C implementation achieved a better speed than SIMD implementation.

    Computer Science for Continuous Data:Survey, Vision, Theory, and Practice of a Computer Analysis System

    Get PDF
    Building on George Boole's work, Logic provides a rigorous foundation for the powerful tools in Computer Science that underlie nowadays ubiquitous processing of discrete data, such as strings or graphs. Concerning continuous data, already Alan Turing had applied "his" machines to formalize and study the processing of real numbers: an aspect of his oeuvre that we transform from theory to practice.The present essay surveys the state of the art and envisions the future of Computer Science for continuous data: natively, beyond brute-force discretization, based on and guided by and extending classical discrete Computer Science, as bridge between Pure and Applied Mathematics
    corecore