109 research outputs found

    Modern Computer Arithmetic (version 0.5.1)

    Full text link
    This is a draft of a book about algorithms for performing arithmetic, and their implementation on modern computers. We are concerned with software more than hardware - we do not cover computer architecture or the design of computer hardware. Instead we focus on algorithms for efficiently performing arithmetic operations such as addition, multiplication and division, and their connections to topics such as modular arithmetic, greatest common divisors, the Fast Fourier Transform (FFT), and the computation of elementary and special functions. The algorithms that we present are mainly intended for arbitrary-precision arithmetic. They are not limited by the computer word size, only by the memory and time available for the computation. We consider both integer and real (floating-point) computations. The book is divided into four main chapters, plus an appendix. Our aim is to present the latest developments in a concise manner. At the same time, we provide a self-contained introduction for the reader who is not an expert in the field, and exercises at the end of each chapter. Chapter titles are: 1, Integer Arithmetic; 2, Modular Arithmetic and the FFT; 3, Floating-Point Arithmetic; 4, Elementary and Special Function Evaluation; 5 (Appendix), Implementations and Pointers. The book also contains a bibliography of 236 entries, index, summary of notation, and summary of complexities.Comment: Preliminary version of a book to be published by Cambridge University Press. xvi+247 pages. Cite as "Modern Computer Arithmetic, Version 0.5.1, 5 March 2010". For further details, updates and errata see http://wwwmaths.anu.edu.au/~brent/pub/pub226.html or http://www.loria.fr/~zimmerma/mca/pub226.htm

    Development of symbolic algorithms for certain algebraic processes

    Get PDF
    This study investigates the problem of computing the exact greatest common divisor of two polynomials relative to an orthogonal basis, defined over the rational number field. The main objective of the study is to design and implement an effective and efficient symbolic algorithm for the general class of dense polynomials, given the rational number defining terms of their basis. From a general algorithm using the comrade matrix approach, the nonmodular and modular techniques are prescribed. If the coefficients of the generalized polynomials are multiprecision integers, multiprecision arithmetic will be required in the construction of the comrade matrix and the corresponding systems coefficient matrix. In addition, the application of the nonmodular elimination technique on this coefficient matrix extensively applies multiprecision rational number operations. The modular technique is employed to minimize the complexity involved in such computations. A divisor test algorithm that enables the detection of an unlucky reduction is a crucial device for an effective implementation of the modular technique. With the bound of the true solution not known a priori, the test is devised and carefully incorporated into the modular algorithm. The results illustrate that the modular algorithm illustrate its best performance for the class of relatively prime polynomials. The empirical computing time results show that the modular algorithm is markedly superior to the nonmodular algorithms in the case of sufficiently dense Legendre basis polynomials with a small GCD solution. In the case of dense Legendre basis polynomials with a big GCD solution, the modular algorithm is significantly superior to the nonmodular algorithms in higher degree polynomials. For more definitive conclusions, the computing time functions of the algorithms that are presented in this report have been worked out. Further investigations have also been suggested

    Short addition sequences for theta functions

    Get PDF
    International audienceThe main step in numerical evaluation of classical Sl2 (Z) modular forms and elliptic functions is to compute the sum of the first N nonzero terms in the sparse q-series belonging to the Dedekind eta function or the Jacobi theta constants. We construct short addition sequences to perform this task using N + o(N) multiplications. Our constructions rely on the representability of specific quadratic progressions of integers as sums of smaller numbers of the same kind. For example, we show that every generalised pentagonal number c 5 can be written as c = 2a + b where a, b are smaller generalised pentagonal numbers. We also give a baby-step giant-step algorithm that uses O(N/ log r N) multiplications for any r > 0, beating the lower bound of N multiplications required when computing the terms explicitly. These results lead to speed-ups in practice

    Bit Serial Systolic Architectures for Multiplicative Inversion and Division over GF(2<sup>m</sup>)

    Get PDF
    Systolic architectures are capable of achieving high throughput by maximizing pipelining and by eliminating global data interconnects. Recursive algorithms with regular data flows are suitable for systolization. The computation of multiplicative inversion using algorithms based on EEA (Extended Euclidean Algorithm) are particularly suitable for systolization. Implementations based on EEA present a high degree of parallelism and pipelinability at bit level which can be easily optimized to achieve local data flow and to eliminate the global interconnects which represent most important bottleneck in todays sub-micron design process. The net result is to have high clock rate and performance based on efficient systolic architectures. This thesis examines high performance but also scalable implementations of multiplicative inversion or field division over Galois fields GF(2m) in the specific case of cryptographic applications where field dimension m may be very large (greater than 400) and either m or defining irreducible polynomial may vary. For this purpose, many inversion schemes with different basis representation are studied and most importantly variants of EEA and binary (Stein's) GCD computation implementations are reviewed. A set of common as well as contrasting characteristics of these variants are discussed. As a result a generalized and optimized variant of EEA is proposed which can compute division, and multiplicative inversion as its subset, with divisor in either polynomial or triangular basis representation. Further results regarding Hankel matrix formation for double-basis inversion is provided. The validity of using the same architecture to compute field division with polynomial or triangular basis representation is proved. Next, a scalable unidirectional bit serial systolic array implementation of this proposed variant of EEA is implemented. Its complexity measures are defined and these are compared against the best known architectures. It is shown that assuming the requirements specified above, this proposed architecture may achieve a higher clock rate performance w. r. t. other designs while being more flexible, reliable and with minimum number of inter-cell interconnects. The main contribution at system level architecture is the substitution of all counter or adder/subtractor elements with a simpler distributed and free of carry propagation delays structure. Further a novel restoring mechanism for result sequences of EEA is proposed using a double delay element implementation. Finally, using this systolic architecture a CMD (Combined Multiplier Divider) datapath is designed which is used as the core of a novel systolic elliptic curve processor. This EC processor uses affine coordinates to compute scalar point multiplication which results in having a very small control unit and negligible with respect to the datapath for all practical values of m. The throughput of this EC based on this bit serial systolic architecture is comparable with designs many times larger than itself reported previously

    Efficient Computation with Sparse and Dense Polynomials

    Get PDF
    Computations with polynomials are at the heart of any computer algebra system and also have many applications in engineering, coding theory, and cryptography. Generally speaking, the low-level polynomial computations of interest can be classified as arithmetic operations, algebraic computations, and inverse symbolic problems. New algorithms are presented in all these areas which improve on the state of the art in both theoretical and practical performance. Traditionally, polynomials may be represented in a computer in one of two ways: as a "dense" array of all possible coefficients up to the polynomial's degree, or as a "sparse" list of coefficient-exponent tuples. In the latter case, zero terms are not explicitly written, giving a potentially more compact representation. In the area of arithmetic operations, new algorithms are presented for the multiplication of dense polynomials. These have the same asymptotic time cost of the fastest existing approaches, but reduce the intermediate storage required from linear in the size of the input to a constant amount. Two different algorithms for so-called "adaptive" multiplication are also presented which effectively provide a gradient between existing sparse and dense algorithms, giving a large improvement in many cases while never performing significantly worse than the best existing approaches. Algebraic computations on sparse polynomials are considered as well. The first known polynomial-time algorithm to detect when a sparse polynomial is a perfect power is presented, along with two different approaches to computing the perfect power factorization. Inverse symbolic problems are those for which the challenge is to compute a symbolic mathematical representation of a program or "black box". First, new algorithms are presented which improve the complexity of interpolation for sparse polynomials with coefficients in finite fields or approximate complex numbers. Second, the first polynomial-time algorithm for the more general problem of sparsest-shift interpolation is presented. The practical performance of all these algorithms is demonstrated with implementations in a high-performance library and compared to existing software and previous techniques

    More Efficient Algorithms for the NTRU Key Generation using the Field Norm

    Get PDF
    NTRU lattices are a class of polynomial rings which allow for compact and efficient representations of the lattice basis, thereby offering very good performance characteristics for the asymmetric algorithms that use them. Signature algorithms based on NTRU lattices have fast signature generation and verification, and relatively small signatures, public keys and private keys. A few lattice-based cryptographic schemes entail, generally during the key generation, solving the NTRU equation: fGgF=qmodxn+1 f G - g F = q \mod x^n + 1 Here ff and gg are fixed, the goal is to compute solutions FF and GG to the equation, and all the polynomials are in Z[x]/(xn+1)\mathbb{Z}[x]/(x^n + 1). The existing methods for solving this equation are quite cumbersome: their time and space complexities are at least cubic and quadratic in the dimension nn, and for typical parameters they therefore require several megabytes of RAM and take more than a second on a typical laptop, precluding onboard key generation in embedded systems such as smart cards. In this work, we present two new algorithms for solving the NTRU equation. Both algorithms make a repeated use of the field norm in tower of fields; it allows them to be faster and more compact than existing algorithms by factors O~(n)\tilde O(n). For lattice-based schemes considered in practice, this reduces both the computation time and RAM usage by factors at least 100, making key pair generation within range of smart card abilities

    Residue Number System Based Building Blocks for Applications in Digital Signal Processing

    Get PDF
    Předkládaná disertační práce se zabývá návrhem základních bloků v systému zbytkových tříd pro zvýšení výkonu aplikací určených pro digitální zpracování signálů (DSP). Systém zbytkových tříd (RNS) je neváhová číselná soustava, jež umožňuje provádět paralelizovatelné, vysokorychlostní, bezpečné a proti chybám odolné aritmetické operace, které jsou zpracovávány bez přenosu mezi řády. Tyto vlastnosti jej činí značně perspektivním pro použití v DSP aplikacích náročných na výpočetní výkon a odolných proti chybám. Typický RNS systém se skládá ze tří hlavních částí: převodníku z binárního kódu do RNS, který počítá ekvivalent vstupních binárních hodnot v systému zbytkových tříd, dále jsou to paralelně řazené RNS aritmetické jednotky, které provádějí aritmetické operace s operandy již převedenými do RNS. Poslední část pak tvoří převodník z RNS do binárního kódu, který převádí výsledek zpět do výchozího binárního kódu. Hlavním cílem této disertační práce bylo navrhnout nové struktury základních bloků výše zmiňovaného systému zbytkových tříd, které mohou být využity v aplikacích DSP. Tato disertační práce předkládá zlepšení a návrhy nových struktur komponent RNS, simulaci a také ověření jejich funkčnosti prostřednictvím implementace v obvodech FPGA. Kromě návrhů nové struktury základních komponentů RNS je prezentován také podrobný výzkum různých sad modulů, který je srovnává a determinuje nejefektivnější sadu pro různé dynamické rozsahy. Dalším z klíčových přínosů disertační práce je objevení a ověření podmínky určující výběr optimální sady modulů, která umožňuje zvýšit výkonnost aplikací DSP. Dále byla navržena aplikace pro zpracování obrazu využívající RNS, která má vůči klasické binární implementanci nižší spotřebu a vyšší maximální pracovní frekvenci. V závěru práce byla vyhodnocena hlavní kritéria při rozhodování, zda je vhodnější pro danou aplikaci využít binární číselnou soustavu nebo RNS.This doctoral thesis deals with designing residue number system based building blocks to enhance the performance of digital signal processing applications. The residue number system (RNS) is a non-weighted number system that provides carry-free, parallel, high speed, secure and fault tolerant arithmetic operations. These features make it very attractive to be used in high-performance and fault tolerant digital signal processing (DSP) applications. A typical RNS system consists of three main components; the first one is the binary to residue converter that computes the RNS equivalent of the inputs represented in the binary number system. The second component in this system is parallel residue arithmetic units that perform arithmetic operations on the operands already represented in RNS. The last component is the residue to binary converter, which converts the outputs back into their binary representation. The main aim of this thesis was to propose novel structures of the basic components of this system in order to be later used as fundamental units in DSP applications. This thesis encloses improving and designing novel structures of these components, simulating and verifying their efficiency via FPGA implementation. In addition to suggesting novel structures of basic RNS components, a detailed study on different moduli sets that compares and determines the most efficient one for different dynamic range requirements is also presented. One of the main outcomes of this thesis is concluding and verifying the main condition that should be met when choosing a moduli set, in order to improve the timing performance of a DSP application. An RNS-based image processing application is also proposed. Its efficiency, in terms of timing performance and power consumption, is proved via comparing it with a binary-based one. Finally, the main considerations that should be taken into account when choosing to use the binary number system or RNS are also discussed in details.

    A Fast Large-Integer Extended GCD Algorithm and Hardware Design for Verifiable Delay Functions and Modular Inversion

    Get PDF
    The extended GCD (XGCD) calculation, which computes Bézout coefficients ba, bb such that ba ∗ a0 + bb ∗ b0 = GCD(a0, b0), is a critical operation in many cryptographic applications. In particular, large-integer XGCD is computationally dominant for two applications of increasing interest: verifiable delay functions that square binary quadratic forms within a class group and constant-time modular inversion for elliptic curve cryptography. Most prior work has focused on fast software implementations. The few works investigating hardware acceleration build on variants of Euclid’s division-based algorithm, following the approach used in optimized software. We show that adopting variants of Stein’s subtraction-based algorithm instead leads to significantly faster hardware. We quantify this advantage by performing a large-integer XGCD accelerator design space exploration comparing Euclid- and Stein-based algorithms for various application requirements. This exploration leads us to an XGCD hardware accelerator that is flexible and efficient, supports fast average and constant-time evaluation, and is easily extensible for polynomial GCD. Our 16nm ASIC design calculates 1024-bit XGCD in 294ns (8x faster than the state-of-the-art ASIC) and constant-time 255-bit XGCD for inverses in the field of integers modulo the prime 2255−19 in 85ns (31× faster than state-of-the-art software). We believe our design is the first high-performance ASIC for the XGCD computation that is also capable of constant-time evaluation. Our work is publicly available at https://github.com/kavyasreedhar/sreedhar-xgcd-hardware-ches2022

    Polynomial GCD using straight line program representation

    Get PDF
    SIGLEAvailable from British Library Document Supply Centre-DSC:DXN032874 / BLDSC - British Library Document Supply CentreGBUnited Kingdo

    Modular decomposition techniques for stored-logic digital filters

    Get PDF
    Digital filtering is an important signal processing technique whose theory is now well established. At present, however, there are no well-defined and systematic methods available for realising digital filters in hardware. This project aims to develop such methods which are general and technology independent, and adopts a systems and sub-systems design philosophy. The realisation problem is approached in a new way using concepts from finite-automata theory and implementing complete digital filter sections as stored-logic units. Two methods are introduced and developed. [Continues.
    corecore