506 research outputs found

    Radix-2r Arithmetic for Multiplication by a Constant.

    No full text
    International audienceIn this paper, radix-2r arithmetic is explored to minimize the number of additions in the multiplication by a constant. We provide the formal proof that for an N-bit constant, the maximum number of additions using radix-2r is lower than Dimitrov's estimated upper-bound (2.N/log(N)) using double base number system (DBNS). In comparison to canonical signed digit (CSD) and DBNS, the new radix-2r recoding requires an average of 23.12% and 3.07% less additions for 64-bit constant, respectively

    A new Low-Power recoding algorithm for multiplierless single/multiple constant multiplication.

    No full text
    International audienceOptimizing the number of additions in constant coefficient multiplication is conjectured to be a NP-hard problem. In this paper, we report a new heuristic requiring an average of 29.10 % and 10.61 % less additions than the standard canonical signed digit representation (CSD) and the double base number system (DBNS), respectively, for 64-bit coefficients. The maximum number of additions per coefficient is bounded by (N/4)+2, and the time-complexity of the recoding is linearly proportional to N, where N is the bit-size of the constant. These performances are achieved using a new redundant version of radix-28 recoding

    Fast integer multiplication using generalized Fermat primes

    Get PDF
    For almost 35 years, Sch{\"o}nhage-Strassen's algorithm has been the fastest algorithm known for multiplying integers, with a time complexity O(n Ă—\times log n Ă—\times log log n) for multiplying n-bit inputs. In 2007, F{\"u}rer proved that there exists K > 1 and an algorithm performing this operation in O(n Ă—\times log n Ă—\times K log n). Recent work by Harvey, van der Hoeven, and Lecerf showed that this complexity estimate can be improved in order to get K = 8, and conjecturally K = 4. Using an alternative algorithm, which relies on arithmetic modulo generalized Fermat primes, we obtain conjecturally the same result K = 4 via a careful complexity analysis in the deterministic multitape Turing model

    Radix-16 signed-digit division

    Get PDF
    Journal ArticleFor use in the context of a linearly scalable arithmetic architecture supporting high/variable precision arithmetic operations (integer or fractional), a two-stage algorithm for fixed point, radix-16 signed-digit division is presented. The algorithm uses two limited precision radix-4 quotient digit selection stages to produce the full radix-16 quotient digit.The algorithm requires a two digit estimate of the (initial) partial remainder and a three digit estimate of the divisor to correctly select each successive quotient digit. The normalization of redundant signed-digit numbers requires accommodation of some fuzziness at one end of the range of numeric values that are considered normalized. A set of general equations for determining the ranges of normalized signed-digit numbers is derived. Another set of general equations for determining the precisions of estimates of the divisor and dividend required in a limited precision SRT model signed-digit division are derived. These two sets of equations permit design tradeoff analyses to be made with respect to the complexity of the model division. The specific case of a two-stage radix-16 signed-digit division is presented. The staged division algorithm used can be extended to other radices as long as the signed-digit number representation used has certain properties

    Efficient long division via Montgomery multiply

    Full text link
    We present a novel right-to-left long division algorithm based on the Montgomery modular multiply, consisting of separate highly efficient loops with simply carry structure for computing first the remainder (x mod q) and then the quotient floor(x/q). These loops are ideally suited for the case where x occupies many more machine words than the divide modulus q, and are strictly linear time in the "bitsize ratio" lg(x)/lg(q). For the paradigmatic performance test of multiword dividend and single 64-bit-word divisor, exploitation of the inherent data-parallelism of the algorithm effectively mitigates the long latency of hardware integer MUL operations, as a result of which we are able to achieve respective costs for remainder-only and full-DIV (remainder and quotient) of 6 and 12.5 cycles per dividend word on the Intel Core 2 implementation of the x86_64 architecture, in single-threaded execution mode. We further describe a simple "bit-doubling modular inversion" scheme, which allows the entire iterative computation of the mod-inverse required by the Montgomery multiply at arbitrarily large precision to be performed with cost less than that of a single Newtonian iteration performed at the full precision of the final result. We also show how the Montgomery-multiply-based powering can be efficiently used in Mersenne and Fermat-number trial factorization via direct computation of a modular inverse power of 2, without any need for explicit radix-mod scalings.Comment: 23 pages; 8 tables v2: Tweak formatting, pagecount -= 2. v3: Fix incorrect powers of R in formulae [7] and [11] v4: Add Eldridge & Walter ref. v5: Clarify relation between Algos A/A',D and Hensel-div; clarify true-quotient mechanics; Add Haswell timings, refs to Agner Fog timings pdf and GMP asm-timings ref-page. v6: Remove stray +bw in MULL line of Algo D listing; add note re byte-LUT for qinv_

    Horner's Rule-Based Multiplication over Fp and Fp^n: A Survey

    Get PDF
    International audienceThis paper aims at surveying multipliers based on Horner's rule for finite field arithmetic. We present a generic architecture based on five processing elements and introduce a classification of several algorithms based on our model. We provide the readers with a detailed description of each scheme which should allow them to write a VHDL description or a VHDL code generator

    Composite Iterative Algorithm and Architecture for q-th Root Calculation

    Get PDF
    An algorithm for the q-th root extraction, being q any integer, is presented in this paper. The algorithm is based on an optimized implementation of X^{1/q} by a sequence of parallel and/or overlapped operations: (1) reciprocal, (2) digit-recurrence logarithm, (3) left-to-right carry-free multiplication and (4) on-line exponential. A detailed error analysis and two architectures are proposed, for low precision q and for higher precision q. The execution time and hardware requirements are estimated for single and double precision floating-point computations for several radices; this helps to determine which radices result in the most efficient implementations. The architectures proposed improve the features of other architectures for q-th root extraction.Dans cet article, nous présentons un algorithme matériel pour l'extraction de la racine q-ième d'un nombre X, où q est un entier naturel non nul. Cet algorithme est basé sur une implantation optimisée de la fonction X^{1/q} par une séquence d'opérations parallèles et/ou superposées: (1) réciproque, (2) logarithme chiffre par chiffre, (3) multiplication de gauche-à-droite sans propagation de retenue et (4) exponentielle en ligne. Une analyse détaillée des erreurs et deux architectures sont proposées, pour q de basse précision et pour q de précision plus haute. Le temps d'exécution et les composants matériels à utiliser sont estimés pour des calculs en virgule flottante simple et double précision et pour plusieurs bases. Cette étude aide à déterminer quelles bases mènent aux implantations les plus efficaces. Les architectures proposées améliorent les caractéristiques d'architectures précédentes destinées à l'extraction des racines

    On Polynomial Multiplication in Chebyshev Basis

    Full text link
    In a recent paper Lima, Panario and Wang have provided a new method to multiply polynomials in Chebyshev basis which aims at reducing the total number of multiplication when polynomials have small degree. Their idea is to use Karatsuba's multiplication scheme to improve upon the naive method but without being able to get rid of its quadratic complexity. In this paper, we extend their result by providing a reduction scheme which allows to multiply polynomial in Chebyshev basis by using algorithms from the monomial basis case and therefore get the same asymptotic complexity estimate. Our reduction allows to use any of these algorithms without converting polynomials input to monomial basis which therefore provide a more direct reduction scheme then the one using conversions. We also demonstrate that our reduction is efficient in practice, and even outperform the performance of the best known algorithm for Chebyshev basis when polynomials have large degree. Finally, we demonstrate a linear time equivalence between the polynomial multiplication problem under monomial basis and under Chebyshev basis
    • …
    corecore