506 research outputs found
Radix-2r Arithmetic for Multiplication by a Constant.
International audienceIn this paper, radix-2r arithmetic is explored to minimize the number of additions in the multiplication by a constant. We provide the formal proof that for an N-bit constant, the maximum number of additions using radix-2r is lower than Dimitrov's estimated upper-bound (2.N/log(N)) using double base number system (DBNS). In comparison to canonical signed digit (CSD) and DBNS, the new radix-2r recoding requires an average of 23.12% and 3.07% less additions for 64-bit constant, respectively
A new Low-Power recoding algorithm for multiplierless single/multiple constant multiplication.
International audienceOptimizing the number of additions in constant coefficient multiplication is conjectured to be a NP-hard problem. In this paper, we report a new heuristic requiring an average of 29.10 % and 10.61 % less additions than the standard canonical signed digit representation (CSD) and the double base number system (DBNS), respectively, for 64-bit coefficients. The maximum number of additions per coefficient is bounded by (N/4)+2, and the time-complexity of the recoding is linearly proportional to N, where N is the bit-size of the constant. These performances are achieved using a new redundant version of radix-28 recoding
Fast integer multiplication using generalized Fermat primes
For almost 35 years, Sch{\"o}nhage-Strassen's algorithm has been the fastest
algorithm known for multiplying integers, with a time complexity O(n
log n log log n) for multiplying n-bit inputs. In 2007, F{\"u}rer
proved that there exists K > 1 and an algorithm performing this operation in
O(n log n K log n). Recent work by Harvey, van der Hoeven,
and Lecerf showed that this complexity estimate can be improved in order to get
K = 8, and conjecturally K = 4. Using an alternative algorithm, which relies on
arithmetic modulo generalized Fermat primes, we obtain conjecturally the same
result K = 4 via a careful complexity analysis in the deterministic multitape
Turing model
Radix-16 signed-digit division
Journal ArticleFor use in the context of a linearly scalable arithmetic architecture supporting high/variable precision arithmetic operations (integer or fractional), a two-stage algorithm for fixed point, radix-16 signed-digit division is presented. The algorithm uses two limited precision radix-4 quotient digit selection stages to produce the full radix-16 quotient digit.The algorithm requires a two digit estimate of the (initial) partial remainder and a three digit estimate of the divisor to correctly select each successive quotient digit. The normalization of redundant signed-digit numbers requires accommodation of some fuzziness at one end of the range of numeric values that are considered normalized. A set of general equations for determining the ranges of normalized signed-digit numbers is derived. Another set of general equations for determining the precisions of estimates of the divisor and dividend required in a limited precision SRT model signed-digit division are derived. These two sets of equations permit design tradeoff analyses to be made with respect to the complexity of the model division. The specific case of a two-stage radix-16 signed-digit division is presented. The staged division algorithm used can be extended to other radices as long as the signed-digit number representation used has certain properties
Efficient long division via Montgomery multiply
We present a novel right-to-left long division algorithm based on the
Montgomery modular multiply, consisting of separate highly efficient loops with
simply carry structure for computing first the remainder (x mod q) and then the
quotient floor(x/q). These loops are ideally suited for the case where x
occupies many more machine words than the divide modulus q, and are strictly
linear time in the "bitsize ratio" lg(x)/lg(q). For the paradigmatic
performance test of multiword dividend and single 64-bit-word divisor,
exploitation of the inherent data-parallelism of the algorithm effectively
mitigates the long latency of hardware integer MUL operations, as a result of
which we are able to achieve respective costs for remainder-only and full-DIV
(remainder and quotient) of 6 and 12.5 cycles per dividend word on the Intel
Core 2 implementation of the x86_64 architecture, in single-threaded execution
mode. We further describe a simple "bit-doubling modular inversion" scheme,
which allows the entire iterative computation of the mod-inverse required by
the Montgomery multiply at arbitrarily large precision to be performed with
cost less than that of a single Newtonian iteration performed at the full
precision of the final result. We also show how the Montgomery-multiply-based
powering can be efficiently used in Mersenne and Fermat-number trial
factorization via direct computation of a modular inverse power of 2, without
any need for explicit radix-mod scalings.Comment: 23 pages; 8 tables v2: Tweak formatting, pagecount -= 2. v3: Fix
incorrect powers of R in formulae [7] and [11] v4: Add Eldridge & Walter ref.
v5: Clarify relation between Algos A/A',D and Hensel-div; clarify
true-quotient mechanics; Add Haswell timings, refs to Agner Fog timings pdf
and GMP asm-timings ref-page. v6: Remove stray +bw in MULL line of Algo D
listing; add note re byte-LUT for qinv_
Horner's Rule-Based Multiplication over Fp and Fp^n: A Survey
International audienceThis paper aims at surveying multipliers based on Horner's rule for finite field arithmetic. We present a generic architecture based on five processing elements and introduce a classification of several algorithms based on our model. We provide the readers with a detailed description of each scheme which should allow them to write a VHDL description or a VHDL code generator
Composite Iterative Algorithm and Architecture for q-th Root Calculation
An algorithm for the q-th root extraction, being q any integer, is presented in this paper. The algorithm is based on an optimized implementation of X^{1/q} by a sequence of parallel and/or overlapped operations: (1) reciprocal, (2) digit-recurrence logarithm, (3) left-to-right carry-free multiplication and (4) on-line exponential. A detailed error analysis and two architectures are proposed, for low precision q and for higher precision q. The execution time and hardware requirements are estimated for single and double precision floating-point computations for several radices; this helps to determine which radices result in the most efficient implementations. The architectures proposed improve the features of other architectures for q-th root extraction.Dans cet article, nous présentons un algorithme matériel pour l'extraction de la racine q-ième d'un nombre X, où q est un entier naturel non nul. Cet algorithme est basé sur une implantation optimisée de la fonction X^{1/q} par une séquence d'opérations parallèles et/ou superposées: (1) réciproque, (2) logarithme chiffre par chiffre, (3) multiplication de gauche-à -droite sans propagation de retenue et (4) exponentielle en ligne. Une analyse détaillée des erreurs et deux architectures sont proposées, pour q de basse précision et pour q de précision plus haute. Le temps d'exécution et les composants matériels à utiliser sont estimés pour des calculs en virgule flottante simple et double précision et pour plusieurs bases. Cette étude aide à déterminer quelles bases mènent aux implantations les plus efficaces. Les architectures proposées améliorent les caractéristiques d'architectures précédentes destinées à l'extraction des racines
On Polynomial Multiplication in Chebyshev Basis
In a recent paper Lima, Panario and Wang have provided a new method to
multiply polynomials in Chebyshev basis which aims at reducing the total number
of multiplication when polynomials have small degree. Their idea is to use
Karatsuba's multiplication scheme to improve upon the naive method but without
being able to get rid of its quadratic complexity. In this paper, we extend
their result by providing a reduction scheme which allows to multiply
polynomial in Chebyshev basis by using algorithms from the monomial basis case
and therefore get the same asymptotic complexity estimate. Our reduction allows
to use any of these algorithms without converting polynomials input to monomial
basis which therefore provide a more direct reduction scheme then the one using
conversions. We also demonstrate that our reduction is efficient in practice,
and even outperform the performance of the best known algorithm for Chebyshev
basis when polynomials have large degree. Finally, we demonstrate a linear time
equivalence between the polynomial multiplication problem under monomial basis
and under Chebyshev basis
- …