Abstract-The paper presents an efficient algorithm to compute base-10 logarithm of a decimal number. The algorithm uses a 64-bit floating-point arithmetic, and is based on a digit-by-digit iterative computation that does not require look-up tables, curve fitting, decimal-binary conversion, or division operations. It is the first FPGA prototype of its kind that uses a 64-bit (decimal 16-digit) precision. Two numerical examples have been presented for the purpose of illustration. The algorithm produces very accurate result with a maximum absolute error of 3.53x10 
I. INTRODUCTION
Elementary functions such as the logarithm and the exponential operations have become very useful in many applications such as, financial analysis, tax calculation, internet based applications, and ecommerce [1] , where these operations are used to avoid hardware-expensive multiplication and division operations. In the past, several hardware-efficient methods have been proposed for computing the base-2 logarithm of binary numbers [4] [5] [11]- [16] . However, after the inclusion of decimal floating-point (FP) operation in the latest IEEE754-2008 standard [6] , more researchers have devoted their effort in developing decimal FP algorithms and architectures to efficiently compute logarithms [7] [17], exponentiation [28] , trigonometric operations, etc.
A study has shown that 55% of the numbers stored in the database of 50 big organizations is decimal [21] . There are several software packages available to customer to compute decimal numbers using decimal arithmetic to minimize error [23] , but the softwareimplemented decimal arithmetic requires much longer time to execute than the hardware version [1] , which led momentum to its implementation in hardware. IBM has recently implemented decimal FP architecture in their POWER6 [3] [19] , z9 [29] , and z10 microprocessors [20] . Several decimal architecture of multi-operand carry-save adder [24] [25] , carry look-ahead adder [26] , parallel BCD adder [27] , signed-digit adder [18] , etc. have been proposed.
There are several applications which require the direct computation of decimal (or radix-10) logarithm, such as, to measure the pH in chemistry, the earthquake intensity in Richter scale, the optical density in spectrometry and optics, the brightness of stars in astronomy, etc. [2] . Moreover, the radix-10 logarithm is widely used in computing the ratio of voltage and power levels (called bel) in telecommunications, electronics and acoustics. In most base-10 logarithmic converters, the decimal input is first converted to binary followed by base-2 logarithm computation; after the completion, the result is converted back to decimal radix -these back and forth conversions of bases introduce errors on the system. A generalized iterative algorithm to compute base-k logarithm has been presented in [8] ; however, the division operation in that work limits the performance by incurring erroneous computation. Moreover, the use of lookup tables and the lack of user control on the number of iteration make this algorithm very inefficient for hardware implementation.
We have recently presented a 32-bit decimal logarithm (in short, log) converter [22] . While the decimal32 format, as defined in the IEEE754-2008 standard [6] , is only used for storage, the decimal64 and decimal128 are used for more accurate decimal computation. Being motivated by the fact, in this paper, we extend the algorithm to compute the radix-10 log using decimal64 precision, which is the first FPGA prototype of its kind. The algorithm is based on a digit-by-digit iterative computation that does not require error correction circuitry, look-up tables, curve fitting, or division operations. The number of iterations of the log converter depends on the user defined precision. The previous 32-bit design [22] suffers from high latency (e.g. 40 clock cycles) due to an inefficient power-10 algorithm and unpipelined operation. Here, we present a very efficient power-10 module with a pipelined architecture that may take only 4 clock cycles to produce the log result with a high process throughput of 51 mega-samples/sec. The error analysis shows that the proposed scheme produces very accurate result. The architecture is developed based on 64-bit binary coded decimal (BCD) representation and compliant with the IEEE decimal64 FP standard.
The paper is organized as follows: Section II presents the background. In section III, the digit-by-digit algorithm is presented. The pseudo-code of the algorithm and two examples are also presented for illustration. A detailed hardware implementation is discussed in section IV, where the description of different internal modules is presented. Section V discusses the performance analysis with a comparison of hardware among related log converter designs. The paper is concluded in section VI.
II. BACKGROUND
The general form of any positive number, L can be expressed as:
Where, R is the numerical base, and 
The coefficients in (2) can now be divided into two categories: integer (or "character":
and fraction (or "mantissa": 1 2 3 , , ,...
C C C
). The procedure to compute these coefficients is described in the following section.
III. ALGORITHM FOR DECIMAL64 LOG

A. Decimal64 format in IEEE754-2008
The IEEE 754-2008 decimal FP arithmetic supports the decimal32, decimal64, and decimal128 computation and data interchange formats, and implements all the operations and conversions [6] . The basic decimal FP format is illustrated in Fig. 1 .
The Sign is a 1-bit field and indicates the sign of the number where S is 0 or 1. The combination field is a w+5-bits field that encodes two most significant bits (MSBs) of the exponent and the most significant digit (MSD) of the coefficient. The Not-a-Number (NaN) and Infinite number (Inf) are indicated in the Combination Field. The biased exponent is a w+2 bit quantity, where the value of the first two bits of the biased exponent taken together is 0, 1, or 2. The whole encoded exponent is an unsigned binary integer with the largest unsigned value. The value of the exponent is calculated by subtracting an exponent bias from the value of the encoded exponent, to be able to represent both negative and positive exponents.
The Tailing Significand Field (3j x 10 bits) is formed by appending the decoded continuation digits (j-bit) as a suffix to the most significant digit (MSD) derived from the combination field. Each 10-bit group represents three decimal digits, using Densely Packed Decimal (DPD) encoding [30] . The format encodes a total of p=3j+1 decimal digits, where p = the number of digits in the significand (precision). For decimal64 format: w = 8; j = 5, exponent bias = 398, and p = 16.
B. Proposed iterative algorithm
The IEEE754-2008 standard [6] defines any nonnormalized unsigned decimal fraction as:
. To be compatible with the standard, we extend our input decimal number, P as follows:
( 1) 10
Where, S is the sign, a is the exponent, and c P is the coefficient (in integer form). Taking a 64-bit log of (3) results in (4): In order to perform the initial range reduction, we extend (4) further as given below: 10 10 log
Where, (range:
is the characteristics of 10 log ( ) P and is obtained by detecting leading zeros; k is a decimal fraction (range: 0.1 1 k ). After separating a, and combining (2) and (5), we get the following (6): This division (by 10) operation can be easily implemented by right shifting the input digits. In order to determine the fractional parts (e.g., 1 2 3 , , ,...
we take the power-10 of (7) For cases where the input lies between 0 and 1, the log produces negative result. Interestingly, the proposed algorithm is capable of handling such cases. For this purpose, we first adjust he decimal point as follows: 
Taking the 64-bit radix-10 log in both sides of (9) leads to the following (10): 10 10 log | | log L P n m (10) Now, the computation of 10 log m follows the procedure described in (5) - (8) .
C. Pseudocode of the algorithm
The pseudo-code of the proposed algorithm summarized below and illustrated in Fig. 2: 1. Read decimal input, P and the number of mantissa in the log result, i 2. Transfer P to A and perform initial range reduction 
D. Examples
In order to better illustrate the algorithm, we present two examples in the following section:
1) Determine the logarithm of decimal number, P = 123456789.123456 up to three fractional digits.
The computation steps are as follows:
Here the user sets the number of fractional bits; so,
; hence, the computation process will continue up to the computation of 3 C , and the final answer will be in the format: 0 1 2 3 .
The number of integer in P: ; hence, the computation process will The iteration stops, and the final answer is: 6 0.091 5.909 L Thus, it can be seen that the algorithm does not require any lookup tables, curve fitting, FP division operations, or error correction circuitry. The following section describes the hardware implementation of the proposed scheme.
IV. HARDWARE IMPLEMENTATION
The architecture of the radix-10 log converter is shown in Fig. 3 . It consists of two major units both connected to a controller: Synchronous register-Counter and Core unit. The converter accepts two inputs: a 16-digit decimal number (P, in BCD) including the decimal point (inputted as hex '10') and the desired number of digits after the decimal radix point (i, in binary) in the final log result. Depending on i, a down counter is set which defines the number of iterations. The core unit performs the fundamental computation supervised by the Controller. The width of the data lines in all the following figures is in decimal digit, unless otherwise specified.
The architecture of the core unit is shown in Fig. 4 . It does not show the interaction with the controller. The core architecture is developed using unsigned BCD representation with an internal precision of 16 digits (64-bit binary). The DPS (decimal-point separator) module detects and separates the DP, and then stores the unsigned magnitude to a temporary register. The DP follows a separate path (DP Accumulator -DP Update) that is parallel to the core computation. The 'DP Update' module tracks the position of the decimal-point and updates it after every computation step.
The 16-digit unsigned input is passed to the 'Zero Detector' (ZD) module that determines the number of integers, which is the first digit (or coefficient) of the final log result. The coefficient is updated at the same time in the 'Coefficient Update' module. The right-shift (RS) operation to be performed on A is also achieved at the same clock cycle by simply updating the position of the DP in the 'DP Update' module. The intermittent data is passed into the 'Power-10 Unit' and fed back to ZD for further processing. The data flow to ZD is controlled by 
A. Power-10 architecture
The Power-10 module is a key unit of the log converter and the accuracy of the final result greatly depends on its efficient implementation. Because of its complexity, we have explored several options based on "divide and conquer" algorithm to efficiently implement the unit which are shown in Fig 5. Considering the tradeoff between hardware cost and speed, we have chosen option 3 for our implementation. The algorithm is based on a recursive powering that requires one parallel multiplier unit and 4 cc to complete. This is a significant improvement over the previous implementation [22] which had taken 40 cc for such computation.
The overall architecture of the Power-10 unit is shown in Fig. 6(a) where the width of all data lines is 16-digit. It consists of a 16-digit combinational multiplier, an accumulator, and a few latches. The selection bit (Sel) dictates the multiplication operation: 0 for A*A; 1 for A*B. A key step of the proposed algorithm is to count the number of integers before the decimal point to evaluate the coefficients, i C , which may take any value between 0 and 9 (where, 0 i ). Inside the multiplier, the most significant 16 digits are retained and accumulated for further processing. The data flow is controlled using a 2:1 multiplexer by the controller (lines not shown). After the first multiplication stage, the output (i.e., X 2 ) is stored in a temporary latch so that it can be used later at the fourth multiplication stage. The timing diagram for each clock pulse and the selection/control sequence are shown in Fig  6(b) [here, n indicates the instance of the clock pulse at any given time, t].
B. 16-digit decimal multiplier
Several efficient methods for decimal multiplication have been proposed in the past [9] [10] . Here, we have used a general purpose 16-digit combinational multiplication algorithm which is a modified version of [9] . This is another improvement over the previous 32-bit design [22] where a sequential multiplier was used. The multiplier architecture (as shown in Fig. 7 ) is optimized to reach the desired throughput. The multiplier input is recoded and the partial products are first kept in a redundant format and then accumulated by a tree of redundant adders. Finally the 32-digit product is obtained by converting the carry-save tree's outputs into BCD format. The presented combinational architecture results in low latency and that is why it is chosen for the log converter.
The product, p is given below, where A and B are the signed multiplier and the multiplicand respectively: -one carry (1 bit) and one digit (4-bits). For a 16-digit input, it generates 16 digits and 16 carry bits. In the second step, the carry bits are added; the carry is not propagated to the next digit. The generation of 5X (five times input) is performed by first computing 10X (ten times input) and a simple division operation. The overall architecture of the partial product generator (PPG) is shown in Fig 8. The partial products are added using adder tree and converted to BCD. For this conversion, we have used an efficient signed-digit decimal adder [18] which has the benefit of carry-free addition; however a carrypropagation adder (CPA) must be used to transform the signed-digit sum into an unsigned sum. The operation is shown in Fig. 9 and described as follows by (12) 
V. PERFORMANCE EVALUATION
The architecture of the 64-bit log converter has been prototyped using Verilog and synthesized onto Xilinx Virtex2p FPGA (xc2vp30ff1152-7). The breakdown of the cost of different units is shown in Table I . It can be seen that the Power-10 unit (including a combinational decimal multiplier) consumes the most resources of the entire system.
A. Error analysis
For the targeted 64-bit decimal FP applications, the proposed log converter must be able to achieve the minimum accuracy (i.e., 32-bit binary precision as defined in [31] ) to guarantee correct operation. In order to compute the maximum error of the converted log result, we have performed an error analysis, where a long test vector comprising of 100 16-digit positive decimal numbers (ranging from 0000000000000000 to 9999999999999999 with arbitrary position of decimal point) is used. The error is computed taking a precision of 22-digit (written in Matlab) as reference. Fig. 10 shows the normalized error plot (in log scale) for only five arbitrary samples for different internal digit precision. In the x-axis, the number of decimal digits retained after one power-10 iteration is shown. It can be seen from the plot that the algorithm produces less error if higher number of digits is retained. The proposed architecture is based on a 64-bit binary precision, and thus can only handle up to 16-decimal digits. As a result, the error, as seen in the plot, is fixed after 16 digits. The maximum normalized error at this precision is estimated to be 14 3.53 10 . Table II compares the results of the proposed log converter with other similar designs. In the cases, where the information of logic cells is not available, we have computed it from the count of slices (e.g., one slice is equivalent to two logic cells). The latency indicates the minimum number of clock cycles required to produce decimal 1-digit output. In all cases, similar Xilinx FPGA technology is used and the maximum absolute error along with the number of digit accuracy is presented.
B. Hardware comparisons
We start with two binary designs [12] [13] which give us a rough estimate about the relative comparison between binary and radix-10 designs. Note that, the results presented in [12] and [13] are based on the synthesis of HDL code that was originally generated automatically by C++ program.
The work in [7] is based on a curve fitting (linear approximation) algorithm. Due to the use of look-uptable (i.e., ROM mapping), the design takes only 1 cc to produce the output, but the crude approximation algorithm results in large error (e.g. maximum absolute error is 0.09 with only 3 digit accuracy). There are 16 partition regions used to achieve such accuracy and a complex error-correction circuitry is required at the end. The work in [17] is based on a digital recurrence algorithm. With the use of large look-up-tables and complex mapping, the error is largely minimized, but at the expense of low operational frequency and reduced throughput (e.g. latency is 18 cc), which makes the scheme unsuitable for time critical applications. The decimal digit accuracy is 14, which is still lower than the proposed algorithm. [17] also discusses briefly the extension to 64-bit design, but the cost of actual FPGA implementation is not reported. As a result, we have estimated the cost from the 7-digit core. In [22] , the authors have presented an iterative scheme with a sequential multiplier unit that results in consuming relatively low recourses; however the sequential nature of the architecture and the un-pipelined operation limit the performance by yielding a very large latency and low throughput.
Compared to all existing designs, the proposed scheme has much lesser computation error and higher digit accuracy (very naturally as it uses higher precision), lesser hardware cost, and higher frequency of operation. Compared to [22] , the hardware cost of the proposed 64-bit scheme is higher (and so is the estimated power consumption) because of the two following reasons: (1) use of a combinational multiplier; (2) use of a much higher digit precision. However, the (minimum) latency is 4 cc which makes the proposed scheme very suitable for time and accuracy critical applications. The computation algorithm is generalized and scalable, which means that the architecture can be extended for decimal128 format without causing large increase in the complexity and hardware cost -this is another advantage of the proposed scheme. As an example, for compliance with the IEEE754-2008 decimal64 format, [17] requires a significant increase in two LUTs from 14-digit to 34-digit, and moderate increase in other processing blocks with a large increase in latency by at least two times.
VI. CONCLUSION
The paper presents a fast algorithm and efficient implementation for computing decimal logarithm using 64-bit floating-point arithmetic that complies with the IEEE754-2008 standard. The algorithm is based on a digit-by-digit iterative computation that does not require look-up tables, curve fitting, decimal-binary conversion, or division algorithms. The final logarithmic output is very accurate with a maximum absolute error of 3.53x10 -14 ; no correction or rounding circuitry is required that makes the scheme suitable for timing and accuracy critical applications. The architecture is generalized and scalable -can be extended for decimal128 format. Future research is directed towards such extension, as well as the VLSI implementation of the algorithm. 
