Integer and Floating-Point Constant Multipliers for FPGAs by Brisebarre, Nicolas et al.
Integer and Floating-Point Constant Multipliers for
FPGAs
Nicolas Brisebarre, Florent De Dinechin, Jean-Michel Muller
To cite this version:
Nicolas Brisebarre, Florent De Dinechin, Jean-Michel Muller. Integer and Floating-Point
Constant Multipliers for FPGAs. International Conference on Application-Specific Systems,
Architectures and Processors, 2008, Jul 2008, Leuven, Belgium. IEEE, pp.239-244, 2008,
<10.1109/ASAP.2008.4580184>. <ensl-00269219>
HAL Id: ensl-00269219
https://hal-ens-lyon.archives-ouvertes.fr/ensl-00269219
Submitted on 2 Apr 2008
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.
Integer and Floating-Point Constant Multipliers for FPGAs
Nicolas Brisebarre, Florent de Dinechin, Jean-Michel Muller∗
LIP (CNRS/INRIA/ENS-Lyon/UCBL)
Universite´ de Lyon
{Nicolas.Brisebarre, Florent.de.Dinechin, Jean-Michel.Muller}@ens-lyon.fr
Abstract
Reconfigurable circuits now have a capacity that allows
them to be used as floating-point accelerators. They of-
fer massive parallelism, but also the opportunity to design
optimised floating-point hardware operators not available
in microprocessors. Multiplication by a constant is an im-
portant example of such an operator. This article presents
an architecture generator for the correctly rounded mul-
tiplication of a floating-point number by a constant. This
constant can be a floating-point value, but also an arbitrary
irrational number. The multiplication of the significands
is an instance of the well-studied problem of constant in-
teger multiplication, for which improvement to existing
algorithms are also proposed and evaluated.
1 Introduction
FPGAs (for field-programmable gate arrays) are high-
density VLSI chips which can be programmed to effi-
ciently emulate arbitrary logic circuits. Where a micro-
processor is programmed at the granularity of instructions
operating on 32 or 64-bit data words, FPGAs are pro-
grammed at the bit and register level. This finer grain
comes at a cost: a circuit implemented in an FPGA is
typically ten times slower than the same circuit imple-
mented as an ASIC (application-specific integrated cir-
cuit). Despite this intrinsic performance gap between FP-
GAs and ASIC, the former are often used as a replace-
ment of the latter for applications which don’t justify the
non-recurring costs of an ASIC, or which have to adapt to
evolving standards.
FPGAs have also been used as configurable accelera-
tors in computing systems. They typically excel in com-
putations which exhibit massive parallelism and require
operations absent from the processor’s instruction set.
∗This work was partly supported by the XtremeData university pro-
gramme, the ANR EVAFlo project and the Egide Braˆncus¸i programme
14914RL.
The FloPoCo project aims at exploring the implemen-
tation of such non standard operations, especially in the
floating-point realm [4]. This article is a survey of the
issue of multiplication by a constant in this context.
State of the art and contributions
Multiplication by a constant is a pervasive operation. It
often occurs in scientific computating codes, and is at the
core of many signal-processing filters. It is also useful
to build larger operators: previously published architec-
tures for exponential, logarithm and trigonometric func-
tions [8, 7] all involve multiplication by a constant. A sin-
gle unoptimised multiplication by 4/pi may account for
about one third the area of a dual sine/cosine operator [7].
The present article essentially reconciles two research
directions that were so far treated separately: on the one
side, the optimisation of multiplication by an integer con-
stant, addressed in section 2, and on the other side the
issue of correct rounding of multiplication or division by
an arbitrary precision constant, addressed in section 4.
Integer constant multiplication has been well studied,
with many good heuristics published [3, 6, 13, 5, 1, 15].
Its theoretical complexity is still an open question: it was
only recently proven sub-linear, although using an ap-
proach which is useless in practice [9, 15]. Our contri-
bution in this domain is essentially a refinement of the ob-
jective function: where all previous works to our knowl-
edge try to minimise the number of additions, we remark
that these additions, measured in terms of full adder cells,
have different sizes (up to a factor 4 for the large multi-
plier by 4/pi of [7]), hence variable cost in reconfigurable
logic. Trying to minimise the number of full adders, and
looking for low-latency and easy to pipeline architectures,
we suggest a surprisingly simple algorithm that, for con-
stants up to 64 bits, outperforms the best known algo-
rithms in terms of FPGA area usage and latency. Boullis
and Tisserand [1] also tried to minimise adder size, but as
a post-processing step, after an algorithm minimising the
number of additions.
1
Section 3 describes a multiplier by a floating-point con-
stant of arbitrary size. The architecture is a straightfor-
ward specialisation of the usual floating-point multiplier.
It is actually slightly simpler, because the normalisation
of the result can be removed from the critical path.
Finally, Section 4 deals with the correct rounding of
the multiplication by an arbitrary real constant. Previous
work on the subject [2] has shown that this correct round-
ing requires a floating-point approximation of the constant
whose typical size is twice the mantissa size of the in-
put. This size actually depends on the real constant, and
may be computed using a simple continued fractions algo-
rithm. The other contribution of [2] is the proof of an al-
gorithm which consists of two dependent fused-multiply-
and-add operations. In the FPGA, the implementation will
be much simpler, since it will suffice to instantiate a large
enough FP constant multiplier. Of course, a multiplier by
an arbitrary constant is also capable of computing the di-
vision by an arbitrary constant [14].
What the previous means is that the price of correct
rounding, for a multiplication by an irrational constant
like pi or log 2, will be a typical doubling of the number
of bits of the constant that have to be used in significand
multiplication. As the cost of such a multiplication is sub-
linear in the constant size [9], the price of correct rounding
is actually less than this factor 2 in practice.
All these architectures are implemented in the FloPoCo
framework.
2 Multiplication by an integer con-
stant
Several recent papers [1, 9, 15] will provide the interested
reader with a state of the art on this subject.
Let C be a positive integer constant, written in binary
on k bits:
C =
k∑
i=0
ci2
i with ci ∈ {0, 1}.
Let X a p-bit integer. The product is written CX =∑k
i=0 2
iciX , and by only considering the non-zero ci, it
is expressed as a sum of 2iX . For instance, 17X = X +
24X . In the following, we will note this using the shift
operator<<, which has higher priority than + and −. For
instance 17X = X +X<<4.
If we allow the digits of the constant to be negative
(ci ∈ {−1, 0, 1}) we obtain a redundant representation,
for instance 15 = 01111 = 10001 (16 − 1 written in
signed binary). Among the representations of a given con-
stant C, we will pick up one that minimises the number of
non-zero bits, hence of additions/subtractions.
The well-known canonical signed digits recoding (or
CSD, also called Booth recoding [10]) garantees that at
most k/2 bits are non-zero, and in average k/3.
2.1 Parenthesing and architectures
The CSD recoding of a constant may be translated into a
rectangular architecture[5], an example of which is given
by Figure 1. This architecture corresponds to the follow-
ing parenthesing: 221X = X<<8 + (−X<<5 + (−X<<
2 +X)).
p0 p1 p2 p3 p4 p5 p6 p7
p8
p9
p10
p11
p12
p13
p14
p15
x0
x1
x2
x3
x4
x5
x6
x7
0
0
−X[0..7]
sign extension of −X
Figure 1: Multiplier of an 8-bit input by 221, using the
recoding 100100101
We introduce a new tree adder structure that is con-
structed out of the CSD recoding of the constant as fol-
lows: non-zero bits are first grouped by 2, then by 4, etc.
For instance, 221X = (X<<8−X<<5)+(−X<<2+X).
This new way of parenthesing the sum reduces the critical
path: For k non-zero bits, it is now of ⌈log2 k⌉ additions
instead of k in the linear architecture of Figure 1.
Besides, shifts may also be reparenthesised: 221X =
(X<<3 −X)<<5 + (−X<<2 + X). After doing this, the
leaves of the tree are now multiplications by small con-
stants: 3X, 5X, 7X, 9X... Such a smaller constants will
appear many times in a larger constant, but it will have
to be computed only once: the tree now becomes a DAG
(direct acyclic graph), and the number of additions is re-
duced. A larger example is shown on Figure 2.
2.2 Lefe`vre’s constant multipliers
We have saved adders by going from a tree to a DAG.
Lefe`vre [13] has generalised this idea to an algorithm that
minimises the number of adders: it looks for maximal re-
peating bit patterns in the CSD representation, and gener-
ates them recursively. Lefe`vre observed that the number
0 0 0 0 0 0 0 0 00 + 0 + 0 + 0 + 0 0 0 0 0 0 0 0 0 0 0 0+ + 0 0 + 0 + − 0 − + + 0 + 0 + + 0 + + + 0
5X5X17X5X−3X 3X9X 127X3X
39854788871587X
884279719003555X
558499X4751061X
−43X1859X 2181X 163X
1768559438007110
<<1
Figure 2: DAG architecture for a multiplication by
1768559438007110 (the 50 first bits of the mantissa of
pi).
of additions, on randomly generated constants of k bits,
grows as O(k0.85). Here is an example of the sequence
produced for the same constant 1768559438007110. This
example was obtained thanks to the program rigo.c
written by Raphal Rigo and Vincent Lefe`vre, and avail-
able from Vincent Lefe`vre’s web page1:
1: u0 = x
2: u3 = u0<< 19 + u0
3: u3 = u3<< 20
4: u3 = u3<< 4 + u3
5: u7 = u0<< 14 - u0
6: u6 = u7<< 6 + u0
7: u5 = u6<< 10 + u0
8: u1 = u5<< 16
9: u1 = u1 + u3
10: u7 = u0<< 21 - u0
11: u6 = u7<< 18 + u0
12: u5 = u6<< 4 - u0
13: u2 = u5<< 5 + u0
14: u2 = u2<< 1
15: u2 = u2<< 2 - u2
16: u1 = u1 + u2
This code translates to a much more compact DAG than
the one presented on Figure 2, because it looks for pat-
terns in the full constant instead of just exploiting them
when they appear accidentally (and probably only at the
leaves).
Still, this code isn’t targetted to our context and may
actually produce suboptimal results, as synthesis results
in Table 1 show. On the one side, it doesn’t try to balance
the DAG to minimise the latency. On the other size, it only
minimises the number of additions, but not their actual
hardware cost, which depends on their size. Let us now
formalise this last issue.
2.3 DAG definition and cost analysis
Each intermediate variable in a DAG holds the result of
the multiplication of X by an intermediate constant. To
make things clearer, let us name an intermediate variable
after this constant, for instance, V255 holds 255X , and
V1 = X .
1http://www.vinc17.org/research/mulbyconst/
In the Rigo/Lefe`vre code, each of these intermediate
variables is positive, and subtraction is allowed. Cost
analysis is slightly simpler if we allow negative interme-
diate constants, but no subtraction. We then need unary
negation to build negative constants. To minimise the use
of negation, which has the same cost as addition on an
FPGA, one may always transform a DAG into one with
only one negation computing V
−1 = −X .
To sum up, a DAG is built out of the following primi-
tives:
Shift: Vz ← Vi<<s (z = 2
si),
Neg: Vz ← −Vi (z = −i),
ShiftAdd: Vz ← Vi<<s+ Vj (z = 2
si+ j).
Each variable is a single assignment one, and it is possible
to associate to it
• the maximal size in bits of the result it holds |Vz|,
• the cost in terms of full adder of this computation,
noted cost(Vz).
Thus an optimisation algorithm will maintain a list of
the already computed variables, indexed by the constants.
The size |Vz| is more or less the sum of the size of z and
the size of X: If z ≥ 0 then |Vz| = |X| + ⌊log2(z − 1)⌋,
where the −1 accounts for powers of 2; If z < 0 then
|z| = 1+ |X|+ ⌊log2(−z−1)⌋: one has to budget an ad-
ditional sign bit for sign extension. This bit will actually
be useful only for multiplying byX = 0, whose multipli-
cation by a negative constant is nevertheless positive. This
detail is worth mentioning as it illustrates the asymmetry
between negative constants and positive ones.
Computing the costs is easy once the |Vz| have been
computed:
• The cost of Vz ← Vi<<s is zero (wiring only).
• The cost of Vz ← −Vi is |Vz|. Again, it is probably
best to use this primitive only to compute V
−1 =
−X .
• The cost of Vz ← Vi << s + Vj is |Vz| − s only:
the lower bits of the result are those of Vj . There is
one exception: if Vi and Vj do not overlap, i.e. if
|Vj | < s, then the addition is free if j is positive: the
higher bits are those of Vi and the lower bits those of
Vj . If j is negative, one needs to sign-extend Vj , and
the cost is again |Vz| − s. This situation may only
happen if the size of the constant is at least twice
that ofX , which, strange as it may seem, happens in
several applications: the high-precision polynomial
evaluation that motivated [13] is an example, and the
trigonometric argument reduction of [7] is another
one.
Of course, architecture generation will produce hard-
ware only for the useful parts of the adders (see also Fig-
ure 1). Space is missing to exhibit VHDL produced by
FloPoCo, but readers are invited to try it out.
This cost function describes relatively acurately the
cost of a combinatorial constant multiplier. It has to be
extended to the case of pipelined multipliers: one has to
add the overhead of the registers, essentially for the lower
bits since a registered adder has the same cost as a com-
binatorial one in FPGAs. In principle, one pipeline stage
may contain several DAG levels, at least for the lower lev-
els.
2.4 The IntConstMult class in FloPoCo
FloPoCo currently implements this DAG structure and
outputs VHDL for it. It also implements cost analysis,
but no design space exploration based on it: it currently
only builds the simple DAGs illustrated by Figure 1 and
Figure 2.
For comparison, two sequences produced by rigo.c,
for the significands of pi/2 × 250 and pi/2 × 2107, were
hand-translated into FloPoCo DAGs. From the synthesis
results of Table 1 (these are FP multipliers, but their area
and delay are largely dominated by the significant mul-
tiplication), one observes that for the 50-bit constant, al-
though the number of additions is smaller, the final area is
larger, as many of these additions are very large. This jus-
tifies the introduction of a new cost function. More work
is needed to actually use it in an optimisation program.
Fair comparison would also require to apply to the DAG
given by rigo.c post-optimisations suggested by [1].
3 Multiplication by a floating-point
constant
For the needs of this article, an FP number is written
(−1)s · 2E · 1, F where 1, F ∈ [1, 2[ is a significand and
E is a signed exponent. We shall note wE and wF the
respective sizes of E and F , and F(wE , wF ) the set of FP
numbers in a format defined by (wE , wF ). We want to
allow for different values of wE and wF for the input X
and the output R:
X = (−1)sX · 2EX · 1, FX ∈ F(wEX , wFX )
R = (−1)sR · 2ER · 1, FR ∈ F(wER , wFR)
In all the following, the real value of the constant will
be noted C, possibly an irrational number, and we define
C = (−1)sC · 2EC · 1, FC
the unique floating-point2 representation of C such that
1, FC ∈ [1, 2[. Here FC may have an infinite binary rep-
resentation. We note Ck the approximation of C rounded
to the nearest on wFC = k fraction bits:
Ck = (−1)
sC · 2EC · 1, FCk .
Finally, we also define the real number
1, Fcut =
2
1, FC
∈ [1, 2[ .
We now describe a multiplier that computes the correct
rounding Rk of Ck × X . Then, Section 4 will compute
the minimal k ensuring that ∀X ∈ F(wEX , wFX ), Rk is
the correct rounding of C ×X .
Of course, if C is already a p-bit-significand FP num-
ber, it will be k = p.
The architecture given by Figure 3, and implemented as
the FPConstMult class in FloPoCo, is essentially a simpli-
fication of the standard FP multiplier. The main modifica-
tion is that rounding is simpler. In the standard multiplier,
the product of two significands, each in [1, 2), belongs to
[1, 4). Its normalisation and rounding is decided by look-
ing at the product. In a constant multiplier, it is possi-
ble to predict if the result will be larger or smaller than 2
just by comparing FX with Fcut – in practice, with Fcut
truncated to wFX bits. This is also slightly faster, as the
rounding decision is moved off the critical path.
Exponent computation consists in adding the constant
exponent, possibly augmented by 1 if FX > Fcut. Sign
computation is also straightforward. Exceptional case
handling is also slighly simpler. For instance, if the con-
stant has a negative exponent, one knows that an overflow
will never occur. Likewise, if it is positive or zero, under-
flow (flush to zero) cannot happen.
4 Correct rounding of the multipli-
cation by a real constant
This section proposes a method for computing the min-
imal value of k = wFC allowing for correct rounding
(noted ◦) of the product of any input X by C. First, for a
given k, we show how to build a predicate telling if there
exist values of X such that Rk = ◦(CkX) 6= ◦(CX).
This allows us to look for the minimal k verifying this
predicate, knowing that it is generally expected to be close
to 2wFX .
2As the exponent is constant, the point doesn’t actually float at all.
FloPoCo linear CSD FloPoCo DAG CSD Lefe`vre/Rigo
constant + LUTs delay + LUTs delay + LUTs delay
X on 24 bits, pi/2 on 50 bits 19 435 30 ns 15 467 14 ns 12 645 16 ns
X on 53 bits, pi/2 on 107 bits 38 2018 68 ns 26 1628 21 ns 22 1508 18 ns
Table 1: Synthesis results for floating-point multipliers
shift/roundExn
Exexn
exn
+EC
×1, FC+1
shift right
wF
wFX
1
ER FR
FX
FX > Fcut?
wFC + wFX + 2
ov
ftz
2
2
Figure 3: Multiplier by an FP constant
4.1 Existence of an X such that ◦(CkX) 6=
◦(CX)
The FP multiplier guarantees the correct rounding of the
result of the multiplication by Ck, that is to say,
∀X ∈ F(wEX , wFX ), |Rk − CkX|
≤
1
2
ulp (CkX) ≤
1
2
ulp (Rk),
in which “ulp(t)” (unit in the last place) is the weight of
the least significant bit of t.
Moreover, Ck is also the rounded-to-nearest value of
C. Let εa = |Ck − C|, we have
∀X ∈ F(wEX , wFX ), |Rk−CX| ≤
1
2
ulp (Rk)+X·εa.
We may assume, without any loss of generality, that X
and C belong to [1, 2), i.e. EX = EC = 0. Then we have
∀X ∈ F(wEX , wFX ), 1 ≤ X < 2,
|Rk − CX| <
1
2
ulp (Rk) + 2εa (1)
If we can prove that for allX , |Rk−CX| ≤
1
2
ulp (Rk),
then Rk will always be the closest FP number to CX ,
Rk
FP numbers
be found
If CX is here, then ◦(CX) = Rk
Can CX be here?
2εa
Domain where
CX can
2× εa
1
2
ulp (Rk)
Figure 4: If CX is at a distance greater than 1/2 ulp of
Rk, then it is at a distance lesser than 2εa from the middle
of two consecutive FP numbers.
which is the required property. As shown in Figure 4.1,
if CX satisfies to (1) and is at a distance greater than
1
2
ulp (Rk) from Rk, it is necessarily at a distance lesser
than 2εa from the middle of two consecutive FP num-
bers. Such a point is a rational number of the form
(2A + 1)/(2q), with 2wFR ≤ A ≤ 2wFR+1 − 1 and
q = 2wFR+t, where t is equal to 1 if CX has the same
exponent as X (if FX ≤ Fcut), and is equal to 0 other-
wise.
Therefore, to determine if an input X is such that Rk
is not the correct rounding of CX , one can check first if
there exists an approximation to CX by a rational num-
ber of the form (2A + 1)/(2q), such that |CX − (2A +
1)/(2q)| ≤ 2εa.
The mathematical tool for solving this kind of ratio-
nal approximation issues is continued fractions [11]. Us-
ing them, one can design several methods [2] that make
it possible either to guarantee that CX will not be at a
distance lesser than 2εa from the middle of two consec-
utive FP numbers (hence one can guarantee that the cor-
rect rounding of CX is always returned) or to compute
all counter-examples, that is to say values of X such that
CkX rounded to nearest is not the correct rounding of
CX . In the latter case, one can derive from each counter-
example the value by which we should increment k in or-
der to get a correct rounding.
4.2 A predicate for k
We assume in the sequel that FX < Fcut (the case FX >
Fcut is similar). We then have CX ∈ [1, 2). Let MX be
the integer mantissa ofX , i.e. MX = 2
wFX X . We search
for the integersMX ∈ Z such that
∣∣∣∣
MX
2wFX
C −
2A+ 1
2wFR+1
∣∣∣∣ ≤ 2εa.
Depending on the relative values of wFR and wFX , we
face two situations:
4.2.1 Case where wFR + 1 ≥ wFX
We assume in this case that k = wFR + wFX + 3. We
search for the integersMX ∈ Z such that
∣∣MX2wFR−wFX +1C − 2A− 1
∣∣ ≤ 22+wFR εa.
Since εa = |Ck − C| ≤ 2
−k−1, we have
∣∣MX2wFR−wFX +1C − 2A− 1
∣∣ ≤ 2wFR−k+1.
Note that 2wFR−k+1 < 1/(2MX) iff 2
wFR−k+2MX <
1. AsMX < 2
wFX +1 and 2wFR+wFX−k+3 ≤ 1 since we
assumed wFR + wFX + 3 = k, we have 2
wFR−k+1 <
1/(2MX) for all X ∈ [1, 2[: we compute the continued
fraction expansion of 2wFR−wFX +1C that yields (all the)
candidate values X that may possibly satisfy ◦(CX) 6=
◦(CkX). For all such inputX , we first check exhaustively
if those rounded values actually differ and we collect all
such X in a list L. Then, we compute the minimal value
η of
∣∣∣ MX
2
wFX
C − 2A+1
2
wFR
+1
∣∣∣ when X ranges the list L of all
counter-examples and we set k = max(wFR + wFX +
3, ⌈− log2(η)⌉ + 1). The inequality k ≥ ⌈− log2(η)⌉ +
1 imples k > − log2(η) that yields 2εa ≤ 2
−k < η,
which guarantees that all inputs X will satisfy ◦(CX) =
◦(CkX).
4.2.2 Case where wFR + 2 ≤ wFX
We assume in this case that k = 2wFX +2. We search for
the integersMX ∈ Z such that
∣∣MXC − (2A+ 1)2wFX−wFR−1
∣∣ ≤ 21+wFX εa.
Here, again, we use εa = |Ck − C| ≤ 2
−k−1 to infer
∣∣MXC − (2A+ 1)2wFX−wFR−1
∣∣ ≤ 2wFX−k.
mult. by pi/2, wFX = 53 + LUTs delay
standard (wFC = 53) 16 866 20 ns
correct rounding (wFC = 107) 26 1628 21 ns
Table 2: The price of correct rounding
Here again, from the hypothesis 2wFX + 2 ≤ k, we
infer 2wFX−k < 1/(2MX): the computation of the con-
tinued fraction expansion of C provides a (complete) list
of values X candidate for satisfying ◦(CX) 6= ◦(CkX).
We check exhaustively if those rounded values actually
differ and we collect again all such X in a list L. Let
η be the minimal value of
∣∣∣ MX
2
wFX
C − 2A+1
2
wFR
+1
∣∣∣ when
X ranges the list L of all counter-examples. We set
k = max(wFR +wFX + 3, ⌈− log2(η)⌉+ 1). That value
of k will ensure that ◦(CX) = ◦(CkX) for all input X .
These computations are currently performed in Maple,
but will soon be integrated in the constant multiplier gen-
erator.
5 Conclusion and perspectives
One may argue that floating-point multiplication is too
anecdotical to justify so much effort. Yet it illustrates
what we believe is the future of floating-point on FP-
GAs: thanks to their flexibility, they may accomodate
non-standard optimised operators, for example a correctly
rounded multiplication by an irrational constant. It also
makes a good case study for the implementation of such
non-standard operators: they cannot be offered as off-
the-shelf libraries, they have to be optimised for each
application-specific context. This is the object of the
FloPoCo project. Near-term future work will focus on
automatically pipelining these operators for a given tar-
get frequency. Our results also suggest that there is much
room for improvement in the optimisation of integer mul-
tiplication: we have defined a pertinent design space, but
the exploration of this space is yet to implement.
References
[1] N. Boullis and A. Tisserand. Some optimizations of hard-
ware multiplication by constant matrices. IEEE Transac-
tions on Computers, 54(10):1271–1282, Oct. 2005.
[2] N. Brisebarre and J.-M. Muller. Correctly rounded multi-
plication by arbitrary precision constants. IEEE Transac-
tions on Computers, 57(2):165–174, feb 2008.
[3] K. Chapman. Fast integer multipliers fit in FPGAs (EDN
1993 design idea winner). EDN magazine, May 1994.
[4] F. de Dinechin, J. Detrey, I. Trestian, O. Cret¸, and
R. Tudoran. When FPGAs are better at floating-
point than microprocessors. Technical Report ensl-
00174627, E´cole Normale Supe´rieure de Lyon, 2007.
http://prunel.ccsd.cnrs.fr/ensl-00174627.
[5] F. de Dinechin and V. Lefe`vre. Constant multipliers for
FPGAs. In Parallel and Distributed Processing Tech-
niques and Applications, pages 167–173, 2000.
[6] A. Dempster and M. Macleod. Constant integer multi-
plication using minimum adders. Circuits, Devices and
Systems, IEE Proceedings, 141(5):407–413, 1994.
[7] J. Detrey and F. de Dinechin. Floating-point trigono-
metric functions for FPGAs. In Intl Conference on
Field-Programmable Logic and Applications, pages 29–
34. IEEE, Aug. 2007.
[8] J. Detrey, F. de Dinechin, and X. Pujol. Return of the hard-
ware floating-point elementary function. In 18th Sympo-
sium on Computer Arithmetic, pages 161–168. IEEE, June
2007.
[9] V. Dimitrov, L. Imbert, and A. Zakaluzny. Multiplication
by a constant is sublinear. In 18th Symposium on Com-
puter Arithmetic. IEEE, June 2007.
[10] M. D. Ercegovac and T. Lang. Digital Arithmetic. Morgan
Kaufmann, 2003.
[11] A. Y. Khinchin. Continued Fractions. Dover, 1997.
[12] V. Lefe`vre. Developments in Reliable Computing, chapter
An Algorithm That Computes a Lower Bound on the Dis-
tance Between a Segment andZ2, pages 203–212. Kluwer
Academic Publishers, Dordrecht, 1999.
[13] V. Lefe`vre. Multiplication by an integer constant. Techni-
cal Report RR1999-06, Laboratoire de l’Informatique du
Paralle´lisme, Lyon, France, 1999.
[14] J.-M. Muller, A. Tisserand, B. D. de Dinechin, and
C. Monat. Division by constant for the st100 dsp micro-
processor. In 17th Symposium on Computer Arithmetic,
pages 124–130, Cape Cod, MA., U.S.A, June 2005. IEEE
Computer Society.
[15] Y. Voronenko and M. Pu¨schel. Multiplierless multiple
constant multiplication. ACM Trans. Algorithms, 3(2),
2007.
