Abstract-In this paper we will present a novel method for implementing floating point (FP) elementary functions using the new FP single precision addition and multiplication features of the Altera Arria 10 DSP Block architecture. Our application example will use log(x), one of the most commonly required functions for emerging datacenter and computing FPGA targets. We will explain why the combination of new FPGA technology, and at the same time, a massive increase in computing performance requirement, fuels the need for this work. We show a comprehensive error analysis, both for the overall function, and each subsection of the architecture, demonstrating that the hard FP (HFP) Blocks, in conjunction with the traditional flexibility and connectivity of the FPGA, can provide a robust and high performance solution. These methods create a highly accurate single precision IEEE754 function, which is OpenCL conformant. Our methods map directly to almost exclusively embedded structures, and therefore result in significant reduction in logic resources and routing stress compared to current methods, and demonstrate that newly introduced FPGA routing architectures can be leveraged to use almost no soft resources. We also show that the latency of the log(x) function can be changed independently of the architecture and function, allowing the performance of the function to be adjusted directly to the system clock rate.
I. INTRODUCTION
The motivation for this work is twofold. The HFP DSP FPGA features enable new levels of utility, especially when combined with the flexibility in connectivity; at the same time, new FPGA applications such as those used in datacenters require very different functional capability and density than traditional FPGA use models. The combination of greatly enhanced FPGA architectures, combined with a vital need for higher performance density of new applications, creates a strong incentive for this work.
Many of the new datacenter uses, such as web search and data analytics, require log(x) and exp(x). A recent publication by Microsoft Research [1] stated that the most commonly used functions were log(x), exp(x) and div(x, y). One use case for these functions when accelerating these applications involves using many parallel and latency-sensitive instances. The benefits of this work are manyfold: firstly, the reduced soft logic requirements will allow for increased functional density (absolute number of functions), secondly, the mapping of the arithmetic portions of the function to almost entirely embedded features will reduce soft logic placement issues, thereby increasing the computational density (the ability for a large number of functions to be fit at a certain clock frequency), and thirdly, it allows for a simple latency optimization of the functions (as the embedded DSP features have selectable pipeline depths).
GPU performance is often specified in terms of GFLOPs, or now TFLOPs, of multiply-accumulates (MACs). GPUs often have dedicated function accelerator blocks [2] ; CPUs occasionally have function accelerators, but most elementary functions are typically implemented using proprietary Intel MKL libraries [3] or open source libraries. Work by Markstein [4] detailed targeting modern CPUs, and earlier works by Cody and Waite [5] showed generic algorithm construction. The advantage of an FPGA over a processor type architecture is that if we can find a method to efficiently map a required number of elementary functions for a particular application to uniformly distributed FP features -in this case the HFP DSP Blocks -the sustained performance can approach the peak performance of the device.
Elementary function design for contemporary FPGAs is a complex task because of the multitude of implementation tradeoffs exposed by the architecture features: logic, memory blocks and fixed-point DSP Blocks. For instance, digit-recurrence methods make extensive use of logic resources while polynomial approximation techniques use memory and DSP Blocks; the ratio between memory and DSP Blocks can varied by changing the approximation polynomial degree.
In contrast, microprocessor implementations of elementary functions execute more of the computation using floating-point (FP) arithmetic. FP units support fast execution of addition and multiplication, and use these basic operations in software routines for implementing more complex elementary functions. For most functions, the software routine contains multiple branches of execution, depending on the range of the input. Within these branches various techniques for approximating the function are used; some branches use high degree polynomial evaluation (this may extend to degree 20 or 30), including rational polynomial approximations, Taylor expansions, digitrecurrence methods, and quadratic convergence methods. As an added complication, the arithmetic used internally is often higher than the target precision of the function, for instance double precision elementary functions use double-double and triple-double implementations [6] . This experience is of great concern for FPGA, as multi-precision use would require either multiple cycle arithmetic, which would disrupt the pipeline advantage of a hardware implementation, or need a very expensive multiple operator construction. One of our goals is to find an effective method that works completely in the native precision of the input and output of the function.
Porting these implementations directly to the FPGA architecture by means of high-level synthesis tools is inefficient. The disjoint nature of execution branches potentially allows for resource sharing, however the different algorithms used within the branches makes sharing difficult. Within branches, the high degree polynomial evaluation requires a significant number of additions and multiplications for high-throughput implementations. Efficient FPGA implementations need to make use the entire mix of FPGA features. For instance, the high-degree polynomial evaluations may be replaced by piecewise polynomial approximations where numerous low-degree polynomials are used to provide an equivalent approximation quality. The coefficients for these low-degree polynomials would be stored in the memory blocks, available in thousands on modern FPGA devices. Other bit level manipulation techniques, costly in microprocessors but virtually free in FPGAs, can be used during the range reduction and reconstruction stages.
The paper is organized as follows. Section II provides a background on FP representation, as well as the target Arria 10 FPGA. Section III reviews other published log(x) work to give a context to the improvements made in this work. Section IV describes algorithms for mapping log(x) into hardware, describing an improved range reduction method. Sections V and VI present the implementation approach together with a comprehensive error analysis. While Section V discusses the accuracy of the log(x) function for the selected implementation, the second describes the individual algorithmic components, and analyses the error contribution of each to the function. This allows us to assemble a complete operator. A discussion of the results is provided in section VII. The large advances in technology, performance, and QoR, suggest further work, which is described in section VIII, before we finish with the conclusions and references.
II. BACKGROUND

A. Floating-Point
Let x be the FP input such that x = (−1) s 2 e M where s denotes the sign of the number -with values ∈ {0, 1}, e denotes the exponent and M denotes the mantissa. The IEEE-754 standard for FP arithmetic [7] uses a normalized mantissa with m ∈ [1, 2). Since the leading bit of the binary mantissa representation will be a constant one, it is omitted in the representation. Consequently we use the following notation:
s 2 e 1. f where f denotes the fraction of the FP number. The number of bits used to represent the exponent and fraction give the different FP formats presented by the standard. When dealing with FP arithmetic a useful tool when managing errors is the notion of ulp, which is defined by Harrison [8] as the distance between two adjacent FP numbers.
B. Arria 10 FPGA
The platform for our work is an Arria 10 FPGA with HFP DSP Blocks supporting IEEE754 single precision [9] . The • M20K memory blocks which can be configured in 512x40-bits or 1024x20-bits; • DSP Blocks which can be configured in fixed-point mode to execute one 27x27-bit multiplication, 2 independent 18x19 multiplications ( Figure 1 ) or one sum-of-two 18x19-bit multiplications; • the same DSP Blocks which can be configured in FP mode to execute one single-precision addition, multiplication, accumulation, one multiply-add operation or one multiply-accumulate operation ( Figure 2 ). The configurable latency is not detailed in this block diagram. Up to four pipeline stages are available in FP mode: an input stage, a FP multiplier pipeline stage, a FP adder input stage, and an output stage. Each pipeline stage can be optionally bypassed. For maximum multiply-add performance, all four stages are employed. In the case of a standalone FP multiplier, up to three stages can be used. Likewise, three or four stages can be used for a standalone FP adder. Although only two registers directly connect to it, using the DSP Block input stage will remove the additional combinatorial path from the input of the DSP Block to where the FP adder input register is physically located inside the block. Although the DSP Block is designed to approximately match the maximum performance of the FPGA, the system level performance of a typical large design fit into the FPGA may be considerably lower than this. For example, although a 20nm planar FPGA may have a maximum performance in the 500MHz range, it is not uncommon for complex implementations to achieve only half of that. The configurable latency of the DSP Block will allow us to adjust the latency and performance of the DSP Block downwards to match the current design performance. As our goal is to map, as much as possible, this function to only embedded features, we are able to gain the benefit of latency reduction through a planned performance drop, while not affecting the way that multiple functions will fit into the device.
C. OpenCL Conformance
Recently, OpenCL has become a supported design flow for the FPGA industry [10] , [11] . The promise of efficient implementation with a heavily abstracted design entry may make FPGAs accessible to programmers, who are not typically well versed in hardware design. OpenCL conformance requires a large, clearly defined set of vectors (approximately 4 × 10 9 individual tests) for each function. Each function has a maximum error allowed, which in the case of single precision log(x) is 3 ul ps, although in some cases an additional extra 1 ul p of error is permitted for embedded contexts (which includes FPGAs). Correct rounding is not required for elementary functions, including log(x).
III. PREVIOUS WORK
The problem of FP log(x) has been well studied for FPGA targets, although most of the published methods are not suitable for the new generation of computing applications, because of area, performance, latency, and accuracy.
The performance limitations of software (SW) elementary function have been recognized; in comparison to hardware, table sizes for the exact portion of the functional decomposition are usually small (for cache management reasons), necessitating many term power series for the approximations portions of the calculation. The motivation for a fast single precision FP log(x) was described in [12] , which stated that an improvement was needed over the software library example of a 94 cycle optimized log(x) [13] (GNU glibc requires 196 cycles for the same function [14] ). A speed up of 6-8 times was realized, but with the tradeoff of large tables, and an decrease of accuracy by one to two decimal digits in the returned result. This algorithm was mapped to a Virtex II by Stamatakis [15] . Although the area is competitive (621 LUTs, 932 registers, and 3 DSP48E Blocks), the accuracy is much lower than our target OpenCL compliance.
In [16] , Bruce, et.al. describe a library of FP functions, including log(x). Range reduction for log(x) was performed using division, which is very expensive, with a long latency, compared to the methods described in this work. The latency of the function was not reported.
Dinechin and Detrey showed a FP logarithm in [17] . Their approach did not use a range reduction for the mantissa processing, but instead approximated the contribution of the mantissa directly using a Higher-Order Table-Based Method HOTBM [18] polynomial evaluation. A combinatorial version of their design required 830 slices (each Virtex II slice contains two 4LUTs and two registers) and 9 18x18 multipliers, with an estimated 10% area increase to support pipelined operation, which they estimated to be able to achieve 100MHz.
A recent design by Xilinx [19] , uses a CORDIC datapath [20] , with multiplier based post processing. Single precision accuracy is claimed but not proven; a figure is shown where the maximum error is 1 ulp over a 2K sample set. The core is large compared to the other works: a Virtex7 -3 speed grade implementation requires 4.9K LUTs and 8 DSP48Es, with a very long latency (64), and modest performance (332MHz for a single instance). In our experience, the CORDIC method, with deep pipelines of subsequent carry chains, does not allow for a high functional density, as the relative placement of the LUTs will be constrained by the carry chains. While a single instance may exhibit good performance, multiple instances will show the effects of routing stress and congestion.
IV. NATURAL LOGARITHM TECHNIQUES
Computing the natural logarithm for x starts with applying the logarithm properties on the FP representation of x:
The natural logarithm is only defined on positive inputs, so the above formula can be simplified:
We will focus in the following on the first branch of this function. The sum e log(2) + log(1. f ) may be the result of a massive cancellation when e = −1 and 1. f → 2. In order to return the sufficient number of meaningful result bits for a FP result, the fixed-point precision of the calculation, together with the accuracy of both terms needs to significantly increase. Alternatively, if the massive cancellation is prevented then there will be no loss of accuracy and hence no need for an extended internal precision. This is accomplished by restating the formula with a single branch.
The cancellation condition now triggers the second branch. When 1. f becomes greater than √ 2 and e = −1 the first term becomes zero and all the accuracy is then returned by the second term.
The following notation is used:
where 
In this section we introduce the implementation approach for the natural logarithm on modern FPGAs. We will develop the individual elements of these techniques into our FP target design in the next section.
A. Computing log(m)
The second term log(m) with m ∈ [ Figure 3 . Calculating over this small, gently increasing monotonic section will eliminate the chance of cancellation errors. The value of log(m) may be computed using a Taylor series for log(1 + y) where m is sufficiently close to 1 (y sufficiently small). The Taylor expansion used has the form: log(1 + y) = y − y For a given input and output precision p (for single precision p = 1 + 24) and a small interval for y (|y| < 2 −y max ), the higher order terms with contributions smaller than the ulp of the maximum magnitude term can be dropped. In other words in the expansion below, Due to this wide range of y, the truncated Taylor series with sufficient accuracy would require tens of terms (same order as the required precision) which leads to a costly implementation for a high-throughput FPGA architecture. A general technique used is to reduce the range of the input so that the computation is less costly. The final result is obtained after a reconstruction stage and is based on the computation on the reduced argument. Conversely, a microprocessor-based implementation would prefer the high number of FP operations to the branching required by the range reduction.
B. Range Reduction
Range reduction requires a division, but as we have seen [16] this is not efficient for hardware, in terms of both latency and area:
log(a/b) = log(a) − log(b)
where a is desired to have the form 1 + y.
Ideally, we will try replace the division with a multiplicative inverse. We want to represent m = (1 +
VI. IMPLEMENTATION ACCURACY This section, which describes and analyses our contribution, is split into two parts. The first subsection will provide an error analysis of our range reduction approach, and the following subsections will then combine a description of the implementation of each arithmetic component of the calculations with an associated error analysis.
When computing y we can either use FP or fixed-point arithmetic. Properly scaled and bounded, the mantissa portion calculation in fixed-point arithmetic can achieve single precision accuracy, but may require significant FPGA resources.
Our logarithm implementation computes final result as the sum of three terms:
which we denote for simplicity by terms A, B and C. Let us assume that the three terms are computed in FP as accurately as possible, that is to say, each having a maximum error of half ulp. We use the following notation where the A variables represent values post rounding, non tilde variables are mathematical values and δ represents the value of a half ulp.
and: If B −C << max (B,C) the error present in computing B or C (whichever is larger) will be amplified by this amount. Figure 4 plots the size of the cancellation between the B and C terms. As expected, the cancellation occurs when m is close to 1, either approaching from below or from above within a 2 −9 region. This case is handled separately (in this case there is no need for range reduction) and allows for the B-C computation to provably have no major cancellation when a range reduction stage is required. Now provided that the 3 terms A, B and C are computed accurately to 1/2 ulp, the formula will allow us to compute a natural logarithm within 1.5 ulp.
A. Computing term A = E log(2)
A FP number of the product E log(2) can be obtained directly by tabulation. The 8 bit exponent is used to address a table (in the case of an Arria 10 device, a single M20K configured in 256x40-bit mode) which stores all the possible values of the product.
By definition, this term will have a 1/2 ulp error bound.
B. Computing term C = log(r m top )
When tabulating the natural logarithm of the inverse r m top we make use of the property that m < 1 only when 1. f > √ 2 (Equation 2), which is also our branch condition. Therefore, the table address will hold the branch bit and the top 9 bits of f (the implied leading one is not used, but accounted for). This allows us to double the accuracy of the inversion when m < 1 as there is one extra bit of information used in the input.
Since this result is obtained through tabulation, the accuracy is again guaranteed to be within 1/2ulp.
C. Computing term B = log(1 + y)
The accuracy of log(1 + y) will directly be influenced by the accuracy of y. If no range reduction is necessary (m close to 1), the cancellation during subtraction (y = m − 1) will ensure that no error will be produced. If range-reduction is necessary, then we must ensure that the y argument holds sufficient accuracy such that if the other terms ( A and C) are either zero or have the same magnitude as B, the error will be sufficiently small.
When computing y = m · r m top − 1 in FP our operands will be FP numbers. The product m · r m top is bound by construction to have the form 1.000000001X...X.
The value for y is obtained by subtracting 1 and normalizing:
1.XX...X000000002 −9 .
The error from m · r m top will now be amplified by the cancellation, having a magnitude of 2 −23+9 .
The first term E log (2) is not zero unless the exponent of x is −1. When this term is non-zero, the smallest value will occur when E = 1, in which case the first term is ≈ 0.693. The second term (computed as the difference log(1 + y) − log(r m top )), when the first branch is active (m < √ 2) -we will later show an alternate approach, with a simpler branching condition of m < 1.5 -will have a maximum magnitude of approximately 0.405. As y < 2 −9 , the magnitude of the first term is 1.951e-3, or 1.11111111100000...2 −10 in binary. The final term to compute is log(r m ), which can simply be tabulated.
.10110001011100100001100 log (2) .01100111111101110011011 log(1.5+2ˆ (-10)) which is .10110001011100100001100 log(2) .000000000010101010100111000 log(1+y) -0.0110011111001100100011111011 log(1/ry) -
----------------------------------------
We will now show how to compute y directly in FP. The term m is exact and carries no error into the calculation. The term r m is not exact, and carries into the calculation a maximum error of half ulp. The product m · r m top will approach 1 by construction, and therefore the final subtraction will not introduce any additional error, but will amplify the existing error by the cancellation size.
Consider we are working in precision p with an unbounded exponent range, then δ r ≤ 2 −p−1 . In the above formula we inject the maximum error value and therefore have:
Consider a cancellation size ψ, then the error in the y term is:
This shows that given a ψ-bit cancellation -which we will get by multiplying m by a roughly ψ-bit accurate approximation of its inverse -the error resulting when computing y provided that no rounding error was actually performed on m·r m top , is roughly 2 ψ ulp. The value y is then used to compute a truncated Taylor expansion for log (1 + y), which is term B in our equation. Because of the linearity of the function on the interval close to 1, the 2 ψ ulp error in the input will roughly translate to a 2 ψ ulp error in the output.
This error will be the final error when both terms A and C are zero in our equation. This will be the case then E = 0 for term A and r m top = 1.000000000. Fortunately this case is captured separately by the no range-reduction branch (triggered when m is within an interval of size 2 −9 away from 1).
The 2 ψ ulp error is present nonetheless for inputs slightly larger than 1 + 2 −9 and is visible in Figure 4 when terms B and C have roughly the same magnitude and A = 0.
Therefore, the accuracy of 1 + p-bits stored in r m top is not sufficient having this ψ bit cancellation. Storing more accuracy in r m top is impossible using the single precision format (which was chosen to match the HFP DSP Blocks). The requested error for r m top needs to be smaller than 2 −p−1−ψ provided that we can produce the m · r m top product without rounding errors.
D. Improved accuracy range reduction
We store r m top in fixed-point on a precision 1 + p + ψ bits. The multiplication m · r m top is a fixed point multiplication with the result very close to 1; m is multiplied by a 1 + p + ψ bit accurate inverse computed using only the leading ψ bits of m. The result will have a deterministic position for the leading one, but after the subtraction m · r m top − 1 there is no guarantee where the leading one is found. Hence, a classical leading zero counter and left shifter (normalization stage) is needed for converting this value into FP.
We can now present a novel way of performing this calculation efficiently using the available HFP DSP Blocks; the purpose of this stage is to obtain an accurate FP value y which can then feed into the Taylor polynomial evaluation stage.
We accomplish the translation between the fixed-point product m·r m top , via the difference m·r m top −1 into FP by exploiting the format of this product: leading '1' followed by a number of zeros followed by information bits.
The idea is to represent the fixed-point product m · r m top as a sum of FP numbers having a small overlap: m · r m top = j + i − k. Once accomplished, the difference m · r m top − 1 can be performed using FP arithmetic only.
A very efficient way of obtaining FP values from the fixedpoint product m · r m top is to:
• take j as the most significant 24-bits of the product; m*rmtop = 1.000000000XXXXYYYYYYYYYYYYY j = 1.000000000XXXX + i = 1YYYYYYYYYYYYY-k = 1 Figure 5 . Fixed-point product to FP decomposition showing i, j and k mantissa alignments
• inject a '1' having a weight equal to the LSB position of j and take i as the 24 bits starting with this injected '1' and the next 23 bits of the product; • since an artificial '1' was inserted it needs to be subtracted using k. The leading '1' injected into i is used in order to avoid the leading-zero counting for obtaining i and obtain directly a normalized FP value. Figure 5 shows the fixed-point alignments of the mantissas of the 3 FP values created against the fixed-point product m · r m top .
However, the full computation of y further requires subtracting the FP value '1'. The required order of operations is:
One of the 3 FP operations in Equation 4 can be saved by restructuring the calculation as follows:
and observing that the weight difference between '1' and k, which are both constants, allows packing them into one single FP mantissa, without actually preforming a FP addition. The extra accuracy required for y before feeding into the polynomial evaluation stage is obtained after the addition of the i term. The typical values for ψ are 8-11 bits, meaning that 8-11 bits of m are used to address a table which outputs a 1 + p + ψ precision value. These values are selected such to match the characteristics of the embedded memory blocks M20k.
As previously stated, the value of the product m · r m top will come close to 1, most of the time having a value larger than 1. Since m ≥ m top where m top represent the leading ψ bits of m, then m × 1/m top will be larger than 1 except when m = m top . In this exceptional case the value of log(1 + y) − log(r m ) will be returned by the tabulated log(r m top ) term.
The architecture of the single precision natural logarithm is depicted in Figure 6 . For simplicity, the FP input x is split into its 3 components, the sign, the exponent and the fraction. Figure 6 . Architecture for the single-precision natural logarithm the fixed-point inverse (r m top ), is computed using a table with an extended precision. A fixed-point multiplier then allows us to obtain (mr m top ). Next, the FP subtracter and adder perform Equation 4 in order to obtain y. Data is fed to these units through a network of multiplexers which allow combining the case when |m| ≤ 1 + 2 −9 within the same datapath. When close is high, the subtracter inputs x and the constant 1, while the adder will simply add zero to this result; additionally, the third term in Equation 3 will also be set to zero, since no range reduction is necessary in this case.
The truncated Taylor series is computed using a Horner scheme in order to minimize resources. Sequences of a multiply/add pairs are mapped to a individual FP DSP Blocks. The final value for log(m) is obtained by subtracting log(r m top ) in FP from the Horner evaluation output. The final result is obtained by summing E log(2) to this value.
By inspection, the Taylor series maps to three chained HFP DSP Blocks, with the FP adder in the last block also adding in log(r y ). Only the adder portions of three additional HFP DSP Blocks are used elsewhere in the circuit, both to prepare the input argument for the Taylor series, and implement the final summation. Two more DSP Blocks, configured as fixed point multipliers, implement a 36x34 bit fixed point multiplier to provide the multiplicative inverse function required by the range reduction operation. 
VII. RESULTS
Resource and performance results are shown in Table I . In every case, the architecture and therefore the function is identical, with only the internal pipelining of the HFP DSP Blocks varied. The default highest performance mode latency of the DSP Block in FP mode is four for both the standalone adder and the multiply-add pair.
One goal of this work was to create a core that allowed for higher system density; another way of stating this is that the core would never be the performance bottleneck in the system. Varying the embedded feature latencies while keeping the rest of the behaviour of the core unchanged would allow the core to be performance adjusted to a potentially lower speed system; a lower latency is often preferable, and this flexibility may be an advantage.
In the table, the design examples are itemized by feature latencies. Other combinations are also possible, and the performance results may change as final timing models become available for the device. Although the architecture is constant across all latency options, the number of ALMs drops noticeably; this shows that the logic is typically used for balancing the pipeline, rather than computation. Of particular interest for future work is the LAB usage (the grouping of 10ALMs into a LAB will not have a wholly independent number of inputs and outputs to the general purpose fabric), which may give a better fitting density scale than counting individual ALMs, and may be use to benchmark the efficiency of different options in future work.
We have not compared this work with previous FPGA logarithm cores presented in Section 3, as we felt a comparison would be unfair and unhelpful. Many of the published works are on older FPGA technologies, with often different routing and logic structures, for example, the depth and complexity of LUTs. Accuracy is also either not usually reported or proven; in this work we have proved our accuracy by analysis, as well as the extensive and clearly proscribed OpenCL conformance vectors.
The techniques presented in this paper may eventually result in an order of magnitude soft logic and routing reduction compared to the other approaches; a closer investigation into the resource breakdown of our architecture shows that the vast majority of non-DSP and block-memory resources goes into synchronization paths: either registers or MLAB-based delay chains. The newer Stratix 10 device family on 14nm FinFET process [21] , introduces a new routing architecture, containing a large number of embedded registers. The balancing paths could potentially be transparently incorporated into the embedded routing registers, potentially leading to an almost trivial logic use.
VIII. FUTURE WORK
Optimized exp(x) is used in conjunction with log(x), and div(x, y) is also needed for the same applications [1] . We plan to apply some of these techniques to explore improved hardware implementations for these functions.
This required increase in performance also introduces the need for computational density benchmarking. One of the motivations for this work was to be able to fit more functions, with higher F max , contemporaneous with other sizeable designs implemented in both soft logic, as well as the multiplyaccumulate and vector reductions supported by the HFP DSP features. Full device benchmarks, to give confidence in expected peak to sustained throughput ratios not only for the traditional measurement of MAC GFLOPs -but also for a new benchmark considering the saved overhead of elementary function implementation -should be considered. The availability of new FPGA architectures, such as the hyper-register routing in Stratix 10, may also change the methodology and results for this.
Another area of interest is the scaling of this FP functional mapping to non-native precisions, such as single extended or even double precision. Multi-operator implementation of higher precision arithmetic are used in software, but would be expensive to unroll into a hardware environment. Analogous to the combination of bit level and word level techniques used in this work, the ideal would be to find a way of efficiently using single precision HFP, along with the flexibility of soft logic, to support other precisions.
IX. CONCLUSIONS
In this article we have presented an architecture for the single-precision natural logarithm targeted at FPGA architectures with hardware support for FP addition and multiplication, such as the Arria 10 FPGA. The presented implementation makes efficient use of the entire mix of resources: the DSP Block in fixed-point arithmetic mode, memory blocks, logic, and most significantly, the DSP Block FP arithmetic mode. The accuracy of the proposed implementation is studied using the worst case cancellation values, and is shown to be accurate to 3ulp, the bound required for OpenCL conformance [22] .
We have shown the scalability of this method to future FPGA devices; with expected routing architecture enhancements such as embedded pipelining, the soft logic requirement of these functions may approach zero.
Although timing values for the Arria 10 device are preliminary and the current values are lower than expected, the critical paths of this architecture show that high frequencies (+400MHz) can be expected for this core.
