A general-purpose method for faithfully rounded floating-point function approximation in FPGAs by Thomas, DB
A general-purpose method for faithfully rounded
floating-point function approximation in FPGAs
David B. Thomas
Dept. of Electrical and Electronic Engineering
Imperial College London
dt10@imperial.com
Abstract—A barrier to wide-spread use of Field Pro-
grammable Gate Arrays (FPGAs) has been the complexity of
programming, but recent advances in High-Level Synthesis (HLS)
have made it possible for non-experts to easily create floating-
point numerical accelerators from C-like code. However, HLS
users are limited to the set of numerical primitives provided by
HLS vendors and designers of floating-point IP cores, and cannot
easily implement new fast or accurate numerical primitives.
This paper presents a method for automatically creating high-
performance pipelined floating-point function approximations,
which can be integrated as IP cores into numerical accelerators,
whether derived from HLS or traditional design methods. Both
input and output are floating-point, but internally the function
approximator uses fixed-point polynomial segments, guaranteeing
a faithfully rounded output. A robust and automated non-
uniform segmentation scheme is used to segment any twice-
differentiable input function and produce platform-independent
VHDL. The approach is demonstrated across ten functions, which
are automatically generated then placed and routed in Xilinx
devices. The method provides a 1.1x-3x improvement in area over
composite numerical approximations, while providing similar
performance and significantly better relative error.
Keywords. Function Approximation, Floating-Point, Faithful Round-
ing, FPGA.
I. INTRODUCTION
FPGAs are now used to accelerate complex numerical
algorithms, often using complex data-flow graphs of floating-
point operations. Existing work on floating-point function
approximations has looked for resource efficient methods of
accurately evaluating common functions, such as the expo-
nential [7] and logarithm [8], but each method is specific to
a particular function. When FPGA designers are faced with
a function which is not available as an optimised primitive,
they are forced to use numerical approximations, which re-
quire significant area, have high latencies, and unknown error
properties.
This paper presents a method for approximating any twice-
differentiable function, including those with local minima and
maxima. The resulting tool is completely automated, and
produces faithfully rounded pipelined floating-point operators.
This allows designers to produce hardware operators for func-
tions which have not yet been hand optimised, and to replace
composite uni-variate functions containing multiple rounding-
errors with a single faithfully rounded operator. The main
contributions of this paper are:
• A generic hardware architecture for pipelined floating-
point function approximations, which uses non-linear
segmentation and a fixed-point polynomial evaluator
to provide faithful evaluation.
• A segmentation algorithm which reliably and auto-
matically partitions the floating-point input domain of
a function into segments which can be approximated
with fixed-point polynomials of low degree.
• An evaluation of the architecture and segmentation
algorithm over ten target functions, comparing post-
place and route area and performance in Virtex-6
against reference implementations.
The C++ source code for the tool is freely available as part of
the FloPoCo [1] framework.
II. PROBLEM STATEMENT
We will work with the extended set of real numbers R,
which contains the positive and negative reals, as well as
signed zeros, signed infinities, and NaN (not a number):
R = {NaN,−∞} ∪ {x : x ∈ R ∧ x < 0} ∪ { −0,+0}
∪ {x : x ∈ R ∧ 0 < x} ∪ {+∞}
The extended reals are given a total order, which defines
NaN < −∞ and −0 < +0.
A floating-point representation F ⊂ R is defined using
three parameters: wf (F ), the number of explicit fractional bits;
we(F ), the number of exponent bits; and e−(F ), the minimum
exponent. The number format F then consists of the special
numbers (NaN,±0,±∞), plus regular numbers composed of
sign, exponent, and significand:
F = {±0,±∞,NaN} ∪ {s× 2e × (1 + f)} where
s ∈ {−1,+1}
e ∈ Z ∧ e−(F ) ≤ e ≤ e+(F )
2wf (F )f ∈ Z ∧ 0 ≤ f < 1}
Here the normalised significand will be considered as (1+f) ∈
[1, 2) with f representing the explicit bits from [0, 1). This
definition follows the FloPoCo numbers format [1] which is
often used in FPGAs, rather than IEEE; the main difference is
the lack of denormal numbers.
We also separate a floating-point representation into bi-
nades, where each binade contains all consecutive numbers
with the same exponent, with one binade for each of the special
numbers, and binades associated with each distinct exponent:
bin(x) =
{
x if x ∈ {NaN,−∞,−0,+0,+∞}
(sign(x), expnt(x)) otherwise
Segmentation
Fixed-Point Polynomial
Table-Lookup
sflags expnt significand
2 1 we(FD) wf(FD)
3+we(FD)+wf(FD)
élog2 nù
c0 cd...sflags
2 1 w(c0) w(cd)
c0 cd x
c1
c1
w(c1)
expnt
we(FR)
y = ∑ci xi
x
 ëSiû ≤  i ≤ éSiù 
3+we(FR)+∑w(ci)
sflags expnt significand
we(FR) we(FR)2 1
wf(FD)
Float-Float
Approximator
Fig. 1. Overview of the function approximation architecture.
There is a total order on the binades, following the order on
the contained floating-point numbers.
The function approximation problem requires a target
function ft : R 7→ R defined by the user, and produces
an approximation fa : FD 7→ FR over some floating-point
domain FD and range FR. Faithful rounding requires:
∀x ∈ FD : fa(x) ∈ {upFR(ft(x)), downFR(ft(x))}
where upF (x) returns x or the next representable number in F
less than x, and downF (x) returns x or the next representable
number in F greater than x. At the boundaries of representable
numbers, one or both of the numbers may be a special value,
such as ∞ or Nan. The function ft must be a total function
over the input domain, and fa will also be a total function. If
the target function is only defined or needed over a subset
of the domain, the user can maps other regions to NaN.
The function must be twice-differentiable, and singularities are
allowed at special numbers (zeros and infinities), as long as
the function is well-behaved for regular inputs.
The first goal of this work is that the approximation
should be automated and reliable, so a non-expert user should
have confidence that an approximation will be produced in
a reasonable amount of time. The second goal is that the
approximation should be faithful and complete, over the entire
input domain, so users should not need to understand or
describe accuracy trade-offs.
III. OVERALL APPROACH
The approach suggested here is to partition the input
domain x ∈ FD into n segments S1..Sn at compile time, then
split the run-time calculation of y = fa(x) into three stages:
1) Use a segmentation function s(.) to map the input
x ∈ FD to an integer i = s(x), where i ∈ 1..n
identifies which of n segments contains the input.
2) Evaluate a fixed-point degree-d polynomial yf =
p(xf , ci), where c1..cn each contain d+1 coefficients
for segment i, mapping the input significand to the
output significand.
3) Attach an exponent and sign or special number flag,
producing y = ai(yf ).
The resulting hardware architecture is shown in Figure 1.
Note that Step 2 only depends on i and xf , while Step 3
only depends on i and yf , meaning that only the segment
and the significand are needed after Step 1. To achieve this
we need to choose the segments such that the only thing that
varies within a given segment is the significand, and that the
mapping yf = p(xf ) can be approximated with a polynomial
of a chosen degree. The process for defining these segments
will first be described, then the approximation architecture
given a set of segments is shown.
IV. SEGMENTATION ALGORITHM
We partition the input domain FD into n segments S1..Sn,
where each segment is a contiguous subset of FD and can be
identified by its lower-bound ⌊Si⌋ and upper-bound ⌈Si⌉:
Si = {x : x ∈ FD ∧ ⌊Si⌋ ≤ x ≤ ⌈Si⌉}
with the segments forming a dis-joint covering of the input
domain. We will use the notation {a...b} to represent a segment
with ⌊{a...b}⌋ = a and ⌈{a...b}⌉ = b.
Out approach uses uses the following steps, each of which
ensures stricter properties on the segmentation.
1) Make all segments flat (contained within a single
binade) in their domain.
2) Make all segments monotonically increasing, de-
creasing, or constant in the segment image.
3) Make all segments flat in their image.
4) Split segments until they can be approximated by a
polynomial of degree at most d.
5) Split segments until they can be faithfully approxi-
mated by a fixed-point polynomial of degree at most
d.
Each stage may result in one or more of the current segments
being split, so the total segment count will never decrease.
Certain target functions will cause more growth in different
stages – for example, periodic and multi-modal functions such
as sin(.) will experience growth in stage 2, many functions will
grow in stage 4, while singularities such as 1/xmay experience
some growth in stage 5. The intent and execution of each stage
will now be described.
A. Flatten segment domains
The starting point is the single segment NaN ..+∞, which
includes every binade in the domain. In this stage the single
segment is replaced with one segment for each binade:
S1 = {{NaN}, {−∞}, {−0}, {+0}, {+∞}}
∪
{{
−2e
(
2− 2−wf (FD)
)
...− 2e
}
: e ∈ e−(FD)..e+(FD)
}
∪
{{
2e...2e
(
2− 2−wf (FD)
)}
: e ∈ e−(FD)..e+(FD)
}
Ensuring that each segment only covers one binade of the
domain means that once an input has been assigned to a
segment, only the significand of the input is needed for
further processing. In principle, this means that there will be
5 + 2we(FD)+1 segments after this stage.
For practical purposes, this stage is also used to reflect
explicit directions from the user about what input interval is
interesting to them. For example, if a user specifies that only
non-negative inputs will be used, the target function is modified
to return NaN for all negative segments. These can all then be
optimised into one large segment that returns NaN.
B. Ensure Monotonicity
Flattening the domain ensures the input doesn’t cross
binade boundaries, and if the output also doesn’t cross bi-
nade boundaries then a significand-to-significand fixed-point
polynomial can be used to approximate the segment function.
However, as arbitrary (twice-differentiable) input functions are
allowed, the segment output image must first be restricted to
monotonically increasing or decreasing ranges. This will allow
the next stage to reliably determine where the segment image
intersects the range binade boundaries.
Segments covering special (NaN etc.) inputs will have
constant outputs, so only segments covering regular inputs
need to be considered. The algorithm which transforms the
flat segments S1 to the monotonic segments S2 is shown in
Algorithm 1. This assumes the existence of a function:
roots : (R 7→ R)× R× R 7→ P(R)
which returns a set of the locations of all real roots within a
given interval.
Algorithm 1 EnsureMonotonicity
1: S2 ← ∅
2: for S ∈ S1 do
3: if special(S) then
4: S2 ← S2 ∪ {S}
5: else
6: Z ← roots(f ′t , ⌊S⌋, ⌈S⌉)
7: for z ∈ Z do
8: zd ← downFD (z) {Round down to zd ∈ FD}
9: if zd ∈ S then
10: S2 ← S2 ∪ {{⌊S⌋...zd}}
11: S ← {x : x ∈ S ∧ x > zd} {May be empty}
12: end if
13: end for
14: if S 6= ∅ then {No roots, or left-overs}
15: S2 ← S2 ∪ {S}
16: end if
17: end if
18: end for
This approach is overly conservative, as it will split
segments which contain a minima or maxima, but are still
completely contained within one output binade. For example,
if we choose ft(x) = 10 sin(x)/9, it will correctly split the
segment {0.5...1} at π/4, as the image crosses between two
binades in the range. However, with ft(x) = 9 sin(x)/10
the output maximum at π/4 occurs entirely within one range
binade, so splitting the segment is not strictly necessary.
C. Flatten Segment Images
The segments are now known to be contained within one
binade, and the segment images are monotonic. This allows
the segments to be safely split by Algorithm 2, which simply
walks across each segment domain, and splits it at any points
where the segment image crosses a binade boundary in the
range.
Algorithm 2 FlattenImages
1: S3 ← ∅
2: for S ∈ S2 do
3: while bin(⌊S⌋) 6= bin(⌈S⌉) do
4: r ← max{x : x ∈ S ∧ bin(⌊S⌋) = bin(ft(x))}
5: S3 ← S3 ∪ {{⌊S⌋...r}}
6: S ← {x : x ∈ S ∧ r < x}
7: end while
8: S3 ← S3 ∪ {S}
9: end for
This algorithm doesn’t depend on whether the segment
image is increasing or decreasing, nor whether there are
discontinuities in the initial segment images. Even if there
are no discontinuities in ft, the discreteness of FD and FR
may mean that the output jumps by more than one output
binade, so there is no requirement that adjacent segments have
contiguous range binades. The maximum on line 4 can be
implemented using bisection search over the segment input,
taking O(wf (FD)) time, so is fast even for large significand
widths.
D. Split to real-coefficient polynomials
The segments in S3 are now flat in both the domain and
range, so all inputs for a segment are in the same domain
binade, and all outputs are in the same range binade. This
allows the problem to be recast from a floating-point ap-
proximation problem to a fixed-point approximation problem.
Special segments in the domain map to constants, and any
special segments in the range are special (constant) for the
entire segment, so polynomial approximations are only needed
for regular input and output binades.
Given a segment S, we can define a function fS which
directly maps an input significand to an output significand.
fS : [0, 1) 7→ [0, 1)
fS(x) = sr × 2
−er × ft(sd × 2
ed × (1 + x))− 1 where
(sd, ed) = bin(⌊S⌋) (sr, er) = bin(ft(⌊S⌋))
The transform function is not necessarily unique to a given
segment, and will be shared by any segments with the same
input and output binades.
Given the transformed function, the segments will be split
as necessary until all segments can be faithfully approximated
via a polynomial of degree d or less. This relies on a polyno-
mial approximation function:
approx : (R 7→ R)× R× R× N 7→ R
∞
× R
approx(f, a, b, d)→ (c, ǫ)
The input consists of: a function f , approximation interval
[a, b], and maximum degree d; and returns a tuple of coeffi-
cients c, plus the maximum error ǫ. In an abuse of notation,
the coefficients are an infinite set from c0, c1, ..., c∞, and the
function ensures that:
ǫ ≥
∣∣∣∣∣
∣∣∣∣∣f(x)−
∞∑
i=0
cix
i
∣∣∣∣∣
∣∣∣∣∣
∞,x∈[a,b]
∧ ∀i > d : ci = 0 (1)
Exactly what approximation is performed is left unspecified
– a minimax approximation will guarantee the smallest ǫ, but
for practical purposes it is sometimes useful to fall back on
a more robust method which converges. The approximation
algorithm (both theoretical and the implementation described
later) only requires that the bound is correct, and is entirely
content if approx(.) only returns linear approximations.
The approx(.) function is used by Algorithm 3 to adap-
tively split the segments until all segments can be faithfully
rounded. Currently a recursive binary splitting approach is
used, which is not guaranteed to provide an optimal split, but
which is simple and reliable.
Algorithm 3 SplitPolynomials
1: T← S3
2: S4 ← ∅
3: for S ∈ T do
4: T← T/S
5: (poly, err)← approx(fS , ⌊S⌋, ⌈S⌉, d)
6: if err ≤ 2−w(FR)−3 then
7: S4 ← S4 ∪ {S}
8: else
9: m← downFD ((⌊S⌋+ ⌈S⌉)/2)
10: T← T ∪ {{⌊S⌋...m}}
11: T← T ∪ {{upFD (m)...⌈S⌉}}
12: end if
13: end for
The temporary set T can be viewed as a list or queue,
and in practise the splitting at line 9 is well handled using
recursion. There can be at most wf (FD) levels of recursion,
so stack overflow is not a danger.
The splitting condition on line 6 is deliberately conserva-
tive, and requires a polynomial that is more than faithful. This
is an empirical tradeoff between the time for (potentially sub-
optimal) approximation over the reals, and the more expensive
approximation over fixed-point numbers in the next stage. It
results in more segments than are strictly necessary, but results
in a better (faster) user experience.
E. Split to concrete polynomials
The final stage in the approximation algorithm is to en-
sure that each segment can be faithfully approximated by
a polynomial with fixed-point coefficients. This requires a
more complex function fixapprox which provides the same
behaviour as approx, but takes an extra parameter l defining
the least-significant-bit of each coefficient:
fixapprox : (R 7→ R)× R× R× N× Z∞ 7→ R
∞
× R
fixapprox(f, a, b, d, l)→ (c, ǫ)
This ensures the same properties on ǫ and c as approx, but
also ensures that coefficient ci is represented as a fixed-point
number with li fractional bits:
∀i ≥ 0 : 2lici ∈ Z
Choosing the LSBs for each coefficient requires some sort
of heuristic, and this work appeals to that of [2]. They propose
that for an input interval [0, 1) split into k equal ranges, and
a target output LSB p, the coefficients should have LSBs li =
2−p+ik. In our case some segments will extend over the entire
[0, 1) range, so k = 1. Another empirical heuristic is used to
encourage speed and reliability of approximation, so an extra
bit is added, resulting in the approximation:
li = 2
−w(FR)+i−1 (2)
The final stage uses the segments S4 from the previous
stage, and produces a final set of domain segments, but now
augmented with the fixed-point polynomial and the error. This
set of augmented segments contains all the information needed
to build the concrete hardware architecture. The process is
shown in Algorithm 4.
Algorithm 4 SplitConcrete
1: l← (2−w(FR)−1, 2−w(FR), 2−w(FR)+1, ...)
2: T← S4
3: S5 ← ∅
4: for S ∈ T do {While segments left to process}
5: T← T/S
6: i← 0
7: while S 6= ∅ ∧ i ≤ d do {Find lowest degree poly}
8: (poly, err)← fixapprox(fS , ⌊S⌋, ⌈S⌉, d, l)
9: if err ≤ 2−w(FR)−2 then
10: S5 ← S5 ∪ {(S, poly, err)}
11: S ← ∅
12: else
13: i← i+ 1
14: end if
15: end while
16: if S 6= ∅ then {Segment needs further splitting}
17: m← downFD ((⌊S⌋+ ⌈S⌉)/2)
18: T← T ∪ {{⌊S⌋..m}}
19: T← T ∪ {{upFD (m)..⌈S⌉}}
20: end if
21: end for
The while loop starting at line 7 is used to ensure that
the lowest degree feasible polynomial is chosen, which has
two purposes. First, it is used to decrease complexity for the
fixapprox function, by not asking it to approximate very sim-
ple segments with high-degree polynomials. Highly linear seg-
ments are very common, for example in sin(x) close to zero,
which can cause stability problems in minimax algorithms.
Second, it avoids the situation where fixapprox can produce
a high-degree approximation to a low-degree segment, but
does so by producing exquisitely balanced polynomials. Such
polynomials may result in very large high-order coefficients,
drastically increasing the cost of the polynomial evaluator and
coefficient tables.
The extra splitting step at line 16 is another step for reliabil-
ity and usage. In principle the polynomials are already accurate
bj[i]
x
x < b1
x   i0
x < bj[i]
bj[i]
x   ij+1
x   ij
x < bj[i]
x   ij..i0  ij
addr
data
bj[i]
x   ij..i0  ij+1
bj[i]
j=2..7, delay=1
j=1, delay=1
j>7, delay=3
Fig. 2. Hardware implementation of the segmenter for increasing numbers
of index bits, using LUTs for small j and block-RAMs for larger j.
enough, but in practise fixapprox may not be able to find a
solution, either directly reporting an error, or taking excessively
long. In such cases it is better to split the segment again,
resulting in an extra segment and sub-optimal polynomials,
but guaranteeing completion.
V. HARDWARE ARCHITECTURE
The output of the segmentation algorithm is a set of tuples
S5 = {S1, S2, ..., Sn}, where n depends on the approximation
range, the input and output types, and the properties of the
target function. Each segment is a tuple (S, p, ǫ), containing:
• S: the segment input domain, which is guaranteed not
to overlap any other segment input domain.
• p: polynomial coefficients of (at most) degree d,
rounded to the LSBs given in Equation 2;
• ǫ: the maximum absolute error between the polyno-
mial output and the target function, expressed in terms
of the output significand (i.e. ǫ ≤ 2−wf (FR) means it
is within one ULP).
These segments can now be used to build the hardware
architecture described earlier in Section III. There are only
three components: segmenter; table-lookup; and polynomial
evaluator. Table-lookup is straightforward, but design decisions
still need to be made in the segmentation and polynomial
evaluation stage.
A. Segmentation
Given an input x ∈ FD, the segmenter needs to find the
unique i ∈ 1..n such that ⌊Si⌋ ≤ i ≤ ⌈Si⌉. Because the
segments form a complete partition of FD, such a segment
is guaranteed to exist, even if the segment just returns NaN.
There is no special structure to the partition apart from the total
ordering over the segments, so it takes k = ⌈log2 n⌉ steps to
locate the correct segment using binary search.
As the target is a pipelined hardware architecture, the
binary search is unrolled into k stages, each of which recovers
one bit of i, shown in Algorithm 5. At the start of each stage
j there are j−1 known bits of i, which are used to select one
Algorithm 5 SearchSegments
i1 ← lessp(x, b1[0])
i2 ← 2i1 + lessp(x, b2[i1])
i3 ← 2i2 + lessp(x, b3[i2])
...
ik ← 2ik−1 + lessp(x, bk[ik−1])
of 2j−1 boundary points from a table bj . The boundary points
are extracted from the segment boundaries as:
bj [i] = ⌊S2k−j(2i+1)⌋, 1 ≤ j ≤ k, 0 ≤ i ≤ 2
j−1
The total number of entries in the boundary tables grows
linearly with the number of segments, while the number of
boundary tables grows logarithmically.
In hardware each stage is implemented using a table look-
up and comparison, with the area and delay cost increasing for
later stages. The comparison function lessp(., .) in Algorithm 5
is derived from the ordering over the input domain FD, and
provides the new bit at each stage:
lessp(x, y) =
{
1, if x ≤ y
0, otherwise
In a hardware this total order can be implemented using a case
statement to order the two numbers according to the segment
type, then an integer comparison of the exponent and fraction
fields to resolve ordering within segment types.
Figure 2 shows the three types of stage used in the current
implementation. For j = 1 the boundary constant can be
combined with the comparator, requiring a single level of
LUTs, and for 1 < j ≤ 7 it is feasible to store the boundaries
in a LUT-based ROM in contemporary architectures, allowing
a delay of one cycle per stage. For j > 7 the tables become
large enough to that they must be stored in block RAM.
B. Polynomial Evaluation
The concrete segments describe n polynomials with fixed-
point coefficients, all of which map from [0, 1) with wf (FD)
fractional bits to [0, 1) with wf (FR) fractional bits. All poly-
nomials share the same fractional precision for the coefficients,
but both the range of the coefficients and the approximation
error will vary for each segment.
The polynomial evaluator must guarantee faithful evalua-
tion across all polynomial segments, while trying to minimimse
pipeline depth and DSP+logic consumption. The approach
used here follows a much simplified form of [2], which
separates evaluation error into approximation, evaluation, and
rounding error:
2−wf (FR) ≥ ǫapprox + ǫeval + ǫround
The maximum approximation error across all polynomials is:
ǫapprox = max (ǫ1..ǫn) ≤ 2
−wf (FR)−2 (3)
The upper bound of 2−wF (FR)−2 on ǫapprox is guaranteed
by line 9 of Algorithm 4, while the final rounding error is
2−wF (FR)−1, which gives a bound on evaluation error of:
ǫeval ≤ 2
−wf (FR) − 2× 2−wf (FR)−2 = 2−wf (FR)−1
Knowing the error budget and the range of the coefficients, in-
termediate rounding can be selected. A rather simple approach
is chosen, which is to choose a number of guard-bits g, and to
round each polynomial stage except the final to −wfFR−i−g
LSBs. So given the polynomial coefficients c0..cd, and an input
x ∈ [0, 1), the approximation is:
ad = cd (4)
ai = round
(
ci + x× aa+1, 2
−wF (FR)−i−g
)
(5)
a0 = round
(
c0 + x× a1, 2
−wF (FR)
)
(6)
where round (x, r) = r ⌊x/r + 1/2⌋ (7)
The guard bits are progressively increased from g = 1 up, and
the first g giving faithful rounding according to the approach
in Section II.D of [2] is chosen. All sources of rounding error
are considered in this evaluation process, so if ǫapprox can be
accurately evaluated (using a tool such as Sollya), then the
fixed-point segment will be faithfully rounded.
For segments which cover the first or last value within a
binade there is a small chance that the faithfully rounded output
will underflow or overflow, so we expect a rounded value a0 ∈
[0, 1), but the fixed-point evaluator returns a0 < 0 or a0 ≥ 1.
Due to the image flattening, we know that a correctly rounded
output must exist within the target binade, so simply saturating
the polynomial to the range [0, 1) is sufficient to ensure correct
rounding. These corner cases can automatically be tested by
ensuring that the endpoints of domain segments are included
in any test input sequences.
VI. EVALUATION
The segmentation algorithm and hardware architecture
have been implemented as a single combined C++ program
within the FloPoCo [1] framework, which takes an approxi-
mation problem as input, and produces pipelined VHDL and
a test-bench. The whole process is automated, and requires no
manual assistance. The Sollya [3] library is used for expression
parsing throughout, and is also used to provide approx(.) using
the remez and uncertifiedInfNorm functions, plus
fixapprox(.) using fpminimax and infNorm. MPFR [4]
is used to provide all high-precision correctly rounded calcu-
lations, both indirectly via Sollya, and directly in many places.
To evaluate the performance of the function approximators,
three types of functions are considered:
1) Primitive functions with existing hand-optimised im-
plementations.
2) Composite functions which can be implemented in
terms of existing hand-optimised primitives.
3) Approximated functions which cannot be expressed
in terms of existing primitives, and must be approx-
imated using polynomials or other techniques.
The reference library of primitive functions is taken to be
those provided by FloPoCo, as a well-respected source of high-
performance floating-point cores for FPGAs.
The list of functions tested is given in Table I, along
with the function definitions, and an approximation interval.
The restricted approximation intervals for certain operators are
intended to reflect typical usage in a hardware application,
TABLE I. SELECTED TEST FUNCTIONS.
Class Name Function Interval
P
log log(x) [0,∞]
exp exp(x) [−∞,∞]
C
normpdf exp(−x2/2)/
√
2pi [−16,+16]
sigmoid 1/(1 + exp(−x)) [−∞,+∞]
log1p log(x+ 1) [−1,+∞]
expm1 exp(x)− 1 [−∞,+∞]
A
sin sin(x) [−pi, pi]
cos cos(x) [−pi, pi]
erf erf(x) [−32,+32]
sintan sin(tan(x)) [−2pi, 2pi]
1.E-09 
1.E-08 
1.E-07 
1.E-06 
1.E-05 
1.E-04 
1.E-03 
1.E-02 
1.E-01 
1.E+00 
2^-28 2^-24 2^-20 2^-16 2^-12 2^-8 2^-4 2^0 2^4 
 
M
ax
 
Re
l. 
Er
ro
r 
Binade 
cos 
erf 
expm1 
log1p 
normpdf 
Fig. 3. Maximum relative error of composite references within positive
binades for composite functions (generic approximations not shown, as they
are all faithful).
based on the author’s experience building such systems. For
comparison purposes, the composite functions were imple-
mented by chaining together existing FloPoCo operators, using
the built-in critical path management to maintain performance.
For the approximate functions, a comparison point was pro-
vided by using existing published numerical approximations.
This follows the process typically used by FPGA designers
when faced with a function with no available hardware prim-
itive – search the literature for an approximation which: a)
contains no loops; and b) only relies on functions for which
they do have hardware primitives. This paper uses approxima-
tions given in Abramowitz and Stegun [5], which is a well-
known source of approximations.1 The formulas and operation
counts are given in Table II, which were implemented using
FloPoCo, again using critical path management, and also
taking advantage of constant multipliers and squarers where
possible. To be clear, we do not claim the A&S approximations
are the most accurate or most efficient, but they are typical of
the solutions used by hardware engineers, who do not know
about minimax tools or range-reduction methods.
One motivation for this work is that faithful rounding is
important, as it ensures relative error bounds. To demonstrate
how bad the relative error of the composite reference imple-
mentations can get, Figure 3 shows the maximum relative
error of five of the functions within each positive input binade.
As expected, the relative error of expm1 and log1p degrades
1The target audience of this paper knows that there are much better ways of
generating approximations, as does the author. However, the target audience
of the tool just want to get things done; in the absence of an off-the-shelf IP
core, an A&S approximation is a solution, while a book or paper is not.
TABLE II. NUMERICAL APPROXIMATIONS USED FOR FUNCTIONS.
Function A&S x× C x2 x× y x + C 1/x exp(x)
sin 4.3.97 1 1 5 5
cos 4.3.99 1 1 4 5
erf 7.1.26 2 5 7 1 1
dramatically as the input approaches zero, with erf following
a similar degradation. In contrast, cos shows poor error for
large inputs, due to naive range reduction for inputs in the
range [π/2, π], while normpdf loses accuracy due to rounding
when squaring the input. The generic methods proposed here
are not shown, as they are simply staight lines showing faithful
rounding.
VHDL implementations were generated for each target,
using the generic approximation method for d = 3 and
d = 4, and the reference method (primitive, composite, or
approximate). The target platform was set as Virtex-6 with
a target of 400MHz for all generated functions. The generic
implementations were then all tested to ensure they were all
faithful. Exhaustive testing has been applied to all operators
for half-precision we(FD) = we(FR) = 5, wf (FD) =
wf (FR) = 10, with no errors observed. Single precision
operators have been tested using: 1000 points evenly spaced
within each input binade; 100000 points evenly spaced over the
input interval; 100 points evenly spaced within each domain
segment, including ⌊S⌋ and ⌈S⌉.
This testing is not intended to check whether the con-
struction process is correct, it is only to check for coding
or synthesis errors in the concrete tools. The process as
given should be faithful by construction, as all floating-point
calculations during construction are performed using either
MPFR [4] with correct rounding, or Sollya [3], which uses
interval arithmetic to control evaluation error. This includes
all calculations related to segmentation, including root finding
and function minimisation, and the bounding of the maxi-
mum approximation across all polynomials. This results in a
conservative upper bound on the maximum polynomial error
ǫapprox in Equation 3, which is then taken into account while
determining the number of guard bits needed in Equation 7
for faithful rounding.
A number of heuristics are introduced in the process
desribed in this paper, such as the deliberate over-segmentation
in Section IV-D and Section IV-E, but these are only aimed at
achieving good compile-time performance and reliability - the
result should always be faithful. As a tool for FPGA users, the
emphasis is taking zero designer effort and a small amount of
tool/compiler time, rather than an extremely optimal process
which may take many hours and need manual tweaking. The
worst-case would be failing with an error message that “remez
failed to converge”, or “function does not oscillate”, as the user
would not be able to take any meaningful action.
All implementations were synthesised, placed, and routed,
using Xilinx ISE 14.3. The target device was a Virtex-6
xc6vcx130t-ff1156-2 part. Re-timing was enabled during syn-
thesis, and gobal optimisation during mapping, but apart from
that the defaults were used, and the same toolchain settings
were used for both generic approximations and reference
components.
Table III summarises the resulting single-precision im-
plementations, both in terms of resources, and performance.
Numbers in brackets measure the generic approximations
relative to the reference, and bolded numbers indicate the best
implementation for each category. For the first two primitive
comparisons (log and exp) the generic approximations are
clearly worse on every metric, but this is expected as the
reference primitives take advantage of range-reductions, and
have been heavily optimised. However, the approximation for
log with d = 3 is within a factor of 3 of the primitive in all
metrics except for DSPs.
For the next four comparisons (expm1, log1p, sigmoid,
normpdf) the generic approximations are a little closer to
the composite references, and for the sigmoid actually beat
the composite in terms of LUT and FF resources due to the
logic-based divider. However, the generic approximation is
producing faithfully rounded results, while all of the composite
functions have known regions of relative inaccuracy.
The last group of comparisons (erf, cos, sin, tansin) are
the main target for this work: functions which have no direct
composite implementation, and currently rely on an engineer to
find some numerical approximation. Now the approximations
are better on every metric except for block-RAMs – the
numerical approximations rely heavily on polynomials, and
so use little RAM. erf in particular shows substantial savings,
requiring very few resources; apart from the latency, it requires
resources similar to the hand-optimised implementation of log.
VII. RELATED WORK
Function approximation has been an active topic of re-
search for many years, both for software and hardware. The
central concepts of range-reduction, approximation, and recon-
struction are well established [6], especially for floating-point
inputs in both software and hardware. This section will focus
on work specifically related to pipelined implementations of
function approximation for use in FPGAs.
Most work has focussed on approximating known common
floating-point functions, such as the exponential [7] and log-
arithm [8]. These use significant function-specific knowledge,
both in the range-reduction and approximation stages, and are
important building blocks both for manual FPGA design, and
for HLS derived designs.
General purpose function approximations for hardware are
less commonly used in FPGAs, and is mostly used in the
fixed-point domain. One example is the approach used in [9],
which uses a similar approach to that used here, using a non-
uniform segmentation of the fixed-point input to select faith-
fully rounded polynomial segments. The segmentation is more
restricted than the approach used here, allowing for a much
more efficient mapping of input to segment, but restricting the
ability to place polynomials exactly where needed.
Another general purpose fixed-point function evaluator is
provided by FloPoCo [2], which uses a uniform segmentation
of the input to make segment selection a simple table-lookup.
This approach is unsuitable as a general-purpose function
approximation, but is ideal for the relatively smooth functions
found within a range-reduced approximation problem.
Altera have produced a large number of floating-point func-
tions for their OpenCL HLS tool, covering all the functions
Function Method LUTs FFs DSPs BRAMs Cycles MHz Lat. (ns)
exp
d = 3 1372 (3.37) 1754 (4.34) 9 (9) 24 (24) 36 (3.6) 166 (0.63) 216 (5.68)
d = 4 1882 (4.62) 2456 (6.08) 15 (15) 15 (15) 42 (4.2) 188 (0.71) 224 (5.89)
Prim 407 - 404 - 1 - 1 - 10 - 264 - 38 -
log
d = 3 1376 (1.72) 1762 (2.95) 10 (2) 12 (12) 34 (2.62) 188 (1.22) 181 (2.15)
d = 4 1618 (2.03) 2068 (3.46) 13 (2.6) 7 (7) 40 (3.08) 200 (1.3) 200 (2.38)
Prim 798 - 597 - 5 - 1 - 13 - 154 - 84 -
expm1
d = 3 1304 (1.79) 1718 (2.23) 9 (9) 12 (12) 35 (1.94) 202 (0.78) 173 (2.47)
d = 4 1876 (2.57) 2419 (3.14) 15 (15) 7 (7) 41 (2.28) 134 (0.52) 306 (4.37)
Comp 730 - 770 - 1 - 1 - 18 - 259 - 70 -
log1p
d = 3 2203 (1.92) 3047 (3.14) 17 (3.4) 10 (10) 36 (1.89) 168 (1.1) 214 (1.73)
d = 4 4020 (3.5) 4974 (5.12) 28 (5.6) 15 (15) 48 (2.53) 207 (1.36) 232 (1.87)
Comp 1148 - 971 - 5 - 1 - 19 - 153 - 124 -
sigmoid
d = 3 1259 (0.81) 1607 (0.89) 9 (9) 5 (5) 34 (1) 206 (0.9) 165 (1.11)
d = 4 1559 (1.01) 1984 (1.09) 12 (12) 6 (6) 40 (1.18) 198 (0.87) 202 (1.36)
Comp 1548 - 1814 - 1 - 1 - 34 - 228 - 149 -
normpdf
d = 3 1498 (1.77) 1966 (3.12) 11 (2.75) 25 (25) 36 (2) 190 (0.9) 190 (2.21)
d = 4 1968 (2.33) 2513 (3.98) 15 (3.75) 16 (16) 42 (2.33) 178 (0.85) 236 (2.74)
Comp 846 - 631 - 4 - 1 - 18 - 210 - 86 -
erf
d = 3 881 (0.19) 1064 (0.26) 6 (0.55) 4 (4) 31 (0.44) 208 (1.1) 149 (0.4)
d = 4 1001 (0.22) 1289 (0.32) 8 (0.73) 5 (5) 37 (0.53) 213 (1.12) 174 (0.47)
Approx 4627 - 4062 - 11 - 1 - 70 - 189 - 370 -
cos
d = 3 1220 (0.47) 1561 (0.75) 9 (0.82) 5 (∞) 32 (0.68) 160 (0.87) 200 (0.78)
d = 4 1498 (0.57) 1889 (0.91) 12 (1.09) 5 (∞) 39 (0.83) 188 (1.03) 207 (0.81)
Approx 2608 - 2087 - 11 - 0 - 47 - 184 - 256 -
sin
d = 3 1366 (0.52) 1690 (0.79) 10 (0.77) 5 (∞) 33 (0.69) 174 (0.9) 189 (0.76)
d = 4 1545 (0.58) 1914 (0.89) 12 (0.92) 6 (∞) 39 (0.81) 203 (1.05) 192 (0.77)
Approx 2648 - 2142 - 13 - 0 - 48 - 193 - 248 -
tansin
d = 3 855 - 1016 - 6 - 3 - 30 - 178 - 169 -
d = 4 991 - 1266 - 8 - 4 - 36 - 186 - 193 -
TABLE III. POST-PLACE AND ROUTE RESULTS FOR SINGLE-PRECISION OPERATORS IN VIRTEX-6 .
required by the OpenCL spec. Their approach is to build
a flexible set of tools and architectures, such as optimised
floating-point polynomial evaluators [10]. This allows them
to quickly construct robust floating-point approximations, but
appears to be, at least partially, a manual process.
The closest approach to this work in terms of goals
appears to be [11], which takes a similar approach to non-
linear segmentation, followed by fixed-point approximation.
They also target automated implementation, and use a similar
approach to the binade approach used here to segment the input
range. However, they use a segmentation scheme that is more
structured, limiting the ability to place segments, and cannot
handle multi-modal functions. There is also an emphasis on
efficiency over accuracy, with relatively low accuracies for
single-precision approximations.
VIII. CONCLUSION
This paper presents an algorithm and architecture for
approximation of floating-point functions in FPGAs. The al-
gorithm is designed to reliably produce a faithfully rounded
approximation for any twice-differentiable function, while
the architecture is designed to allow for high throughput
pipelined implementations. A fully automated open-source
implementation of the approach has been created, suitable for
use by hardware designers with little knowledge of function
approximation.
Experiments in Virtex-6 using ten target functions show
that while the approximations cannot compete with hand-
optimised primitives (as expected), for functions composed of
multiple primitives it is able to provide guaranteed faithful
outputs with a 2x-3x cost in resources. However, the real
strength of the method is in approximating functions which
cannot be decomposed into primitives, where it shows a 1.2x-
3x improvement in resource usage over numerical approxima-
tions, while providing guaranteed accuracy.
REFERENCES
[1] F. de Dinechin and B. Pasca, “Designing custom arithmetic data paths
with flopoco,” Design Test of Computers, IEEE, vol. 28, no. 4, pp.
18–27, July 2011.
[2] F. de Dinechin, M. Joldes, and B. Pasca, “Automatic generation
of polynomial-based hardware architectures for function evaluation,”
in Application-specific Systems Architectures and Processors (ASAP),
2010 21st IEEE International Conference on, July 2010, pp. 216–222.
[3] S. Chevillard, M. Joldes¸, and C. Lauter, “Sollya: An environment for
the development of numerical codes,” in Mathematical Software - ICMS
2010, ser. Lecture Notes in Computer Science, K. Fukuda, J. van der
Hoeven, M. Joswig, and N. Takayama, Eds., vol. 6327. Heidelberg,
Germany: Springer, September 2010, pp. 28–31.
[4] L. Fousse, G. Hanrot, V. Lefe`vre, P. Pe´lissier, and P. Zimmermann,
“MPFR: A multiple-precision binary floating-point library with correct
rounding,” ACM Trans. Math. Softw., vol. 33, no. 2, Jun. 2007.
[Online]. Available: http://doi.acm.org/10.1145/1236463.1236468
[5] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions.
Dover Publications, 1965.
[6] J.-M. Muller, Elementary Functions: Algorithms and Implementation.
Secaucus, NJ, USA: Birkhauser Boston, Inc., 1997.
[7] F. de Dinechin, P. Echeverrı´a, M. Lo´pez-Vallejo, and
B. Pasca, “Floating-point exponentiation units for reconfigurable
computing,” ACM Trans. Reconfigurable Technol. Syst., vol. 6,
no. 1, pp. 4:1–4:15, May 2013. [Online]. Available:
http://doi.acm.org/10.1145/2457443.2457447
[8] N. Alachiotis and A. Stamatakis, “Efficient floating-point logarithm
unit for fpgas,” in Parallel Distributed Processing, Workshops and Phd
Forum (IPDPSW), 2010 IEEE International Symposium on, April 2010,
pp. 1–8.
[9] D.-U. Lee, R. Cheung, W. Luk, and J. Villasenor, “Hierarchical segmen-
tation for hardware function evaluation,” Very Large Scale Integration
(VLSI) Systems, IEEE Transactions on, vol. 17, no. 1, pp. 103–116, Jan
2009.
[10] M. Langhammer and B. Pasca, “Efficient floating-point polynomial
evaluation on fpgas,” in Field Programmable Logic and Applications
(FPL), 2013 23rd International Conference on, Sept 2013, pp. 1–6.
[11] “A synthesis method of general floating-point arithmetic units by
aligned partition,” in Int. Cong. CSCC, 2008, pp. 1177–1180.
