I. INTRODUCTION

P
ERFORMING manual calculations using decimal arithmetic is part of human nature. Typical computers, on the other hand, support binary arithmetic more readily. The ENIAC, which became operational in 1945 at the University of Pennsylvania, was one of the early attempts to use radix 10 calculations in digital computers [1] . The IBM eServer z900 seems to be the only recent processor capable of performing decimal instructions in hardware [2] , [3] . However, its decimal computation capability is limited to integers operands. Recently, decimal arithmetic has become more attractive in the financial and commercial world including banking, tax calculation, currency conversion, insurance, and accounting. The following facts may explain this recent interest.
• A survey of commercial databases [4] shows that 98.6% of the numbers stored are decimal or integer while more than half of them are represented in pure decimal format.
• It is well understood that when converting between decimal and binary formats, most fractional decimal numbers are only approximately represented in binary floating-point (FP) representation and, therefore, may loose precision [5] , [6] . This means that using binary FP numbers in financial applications, which cannot tolerate errors, does not necessarily guarantee correct results.
• Regulations such as that for the European Commission Directorate General II [7] specify decimal digits for currency calculations. The importance of decimal arithmetic has led to a proposed revision to the IEEE 754 standard for FP arithmetic to include specifications for decimal arithmetic [8] . This means that although computers are still carrying out decimal FP (DFP) calculations using software libraries [9] , [10] and binary FP numbers, it is likely that in the near future, many high-end processors will perform decimal operations directly on DFP operands using dedicated DFP units, which are hundreds of times faster than the software packages [11] .
In this paper, a DFP division algorithm and its implementation are proposed. The algorithm is based on well-known highradix SRT 1 division [12] - [15] . The division's partial remainder is represented in a new redundant format called decimal signeddigit (DSD) and the SRT recurrence is carried out using decimal carry-free (DCF) addition [16] . Unlike conventional dividers, which use the potential difference (PD) plot or the selection constants methods for quotient digit selection [17] , the new divider is designed using comparison multiples [18] .
There are three novel issues presented in this paper; employing the SRT division algorithm, which was originally developed for radices of power of 2, to implement a divider with a radix that is not presented as a power of 2, removing the PD plot (the lookup table) from the conventional implementation of the SRT algorithm and replacing it with a set of comparators followed by sign detectors, and introducing a divider that can perform decimal division faster than any available counterpart.
It should be noted that the design is optimized for speed at the expense of power and area. This is reasonable for one of the first DFP dividers in the literature; however, low-power architectures remain an open topic. This paper is organized as follows. High-radix SRT division is discussed in Section II. Section III presents previous related works in decimal division. In Section IV, the fundamentals of DSD arithmetic are discussed. Specification of the DFP format introduced in the proposal of the IEEE 754R standard [8] is briefly explained in Section V followed by a short review of the rounding issues in DFP arithmetic. Section VI describes how high-radix SRT division can be used for implementing DFP division. Section VII gives a timing evaluation of the implementation. This paper presents a conclusion in Section VIII.
II. HIGH-RADIX SRT DIVISION
Surveys [14] and [19] show that many VLSI implementations of FP division are based on the SRT digit recurrence division algorithms. SRT division is an iterative algorithm with linear convergence toward the quotient. In this algorithm, the quotient digit selection (QDS) function calculates a fixed number of bits of the quotient every iteration. The speed of a SRT divider is mainly determined by the complexity of the QDS function [20] , which is traditionally implemented using the lookup table method. In this method, the QDS function is realized in a table implemented with a programable logic array (PLA) or a ROM [17] .
Radix-SRT division compliant with the IEEE 754 standard [21] is implemented using the recurrence where (1) plus one more cycle to produce the quotient in the IEEE 754 standard. In the recurrence (1), the dividend and the divisor are two normalized binary numbers in the range and is the number of iterations needed to produce the bits of the precision. Also, represents the quotient digit in the signeddigit (SD) redundant format [22] selected from the redundant radix-digit set (2) where and . In (1), the next partial remainder (PR),
, is represented in a redundant format [22] . To select the correct , the QDS function applies the most significant bits of and to a lookup table. The resulting quotient digit should be such that is always bounded as (3) where is known as the redundancy factor. The inequality (3) is known as the convergence condition. To satisfy the convergence condition in the first iteration, the first PR, , is initialized to . The quotient represented in SD is converted to radix-nonredundant form using the on-the-fly conversion and rounding algorithm [23] .
III. HISTORY OF DECIMAL DIVIDERS
Yabe et al. [24] present an implementation of a digit recurrence decimal division. Every iteration, limited digit numbers from the normalized divisor and the PR are applied to a prediction table to find a value for the quotient digit. If the new PR is negative, then the divisor is added to the PR (restoration step) and the incorrectly predicted quotient digit, which is represented in a redundant form, is adjusted in the next iteration. This restoring decimal division requires a full-length decimal addition per iteration. Moreover, if restoration is needed, a full-length decimal subtraction must be performed as well. Therefore, due to these two time consuming operations, the divider is very slow.
The division presented by Busaba et al. [3] is a simple restoring division, which like the approach of Yabe et al. [24] , subtracts the divisor from the PR each iteration until the PR becomes negative. The algorithm restores the PR and obtains the corresponding quotient digit. This method, which is used in the IBM z900 decimal arithmetic unit, again uses a very slow full-length BCD addition/subtraction per iteration.
Yamaoka et al. [25] develop a nonrestoring decimal divider, which subtracts the divisor from the dividend every iteration and accumulates the number of iterations in the quotient register. It suffers from a full-length decimal addition/subtraction every iteration since the PR is represented in the conventional binarycoded decimal (BCD) format. Also, as another disadvantage, the number of iterations is unknown until a negative PR is found.
A nonheuristic decimal divider presented by Ferguson [26] , determines one quotient digit every iteration. It limits the number of candidates for the quotient digit to two by normalizing both the divisor and the dividend. The normalization is performed by multiplying the dividend and the divisor by a common factor. This adds a large delay overhead to the operation. To select the correct quotient digit from the candidates, a subtraction must be performed. Using the quotient digit, the corresponding multiple of the normalized divisor is selected from a table. The multiples must be precomputed and stored in the table. This operation requires parallel full-length multiplications (or successive full-length additions/subtractions), which not only increase the decimal division execution time massively but also increase the implementation area. The divider adjusts the final quotient if the divisor is normalized at the beginning of division. This adjustment involves doubling, and several times incrementing or decrementing the quotient. The delay penalty caused by the adjustment is also rather high.
Wang and Schulte [27] claim the first DFP divider complying with the proposal of the IEEE 754R standard. The divider, which uses the Newton-Raphson iteration [14] , doubles the number of quotient digits every iteration. This feature makes it very fast compared to the previously discussed decimal dividers. This improvement is achieved by using an optimized piecewise linear approximation, a modified Newton-Raphson iteration, a specialized rounding technique, and a simplified combined decimal incrementer/decrementer in the design. However, a Newton-Raphson iteration requires two multiplications every iteration. This has a negative impact on the latency of decimal division since a pipelined decimal multiplier is difficult to build and requires much more delay than decimal addition.
IV. DSD ARITHMETIC
Among decimal arithmetic operations, there are some complex functions such as sequential multiplication and digit recurrence division, which compute and accumulate partial results. These repetitive operations cannot be accomplished in a reasonable time without fast circuits for decimal addition. One method for implementing fast decimal adders is to take advantage of carry-free addition.
Nikmehr [18] introduces an 8-bit decimal signed-digit arithmetic for representing redundant decimal digit of the form , where . In DSD arithmetic, a DSD number expressed as is an -digit array with DSD in a maximally redundant set (4) Although a DSD can be represented in 5 bits in 2's complement format [28] , in this paper, the 8-bit format is used because it is a natural extension to the traditional binary signed-digit (BSD) format. Therefore, it is quite possible that this representation makes building circuits capable of performing both binary and decimal division more feasible. In addition, as explained later, while an 8-bit DSD number can be negated using an array of inverters, this needs a 2's complement operation in the 5-bit DSD representation.
The DSD is represented by a BSD vector as , where and 0,1,2,3. Hence, each DSD is encoded as 8 bits. The value of can be determined as , where the operator " " refers to decimal subtraction. A BCD to DSD format conversion can be performed in zero time with no use of hardware. However, converting from DSD to BCD, like all conversions from redundant to nonredundant formats, requires a time consuming operation.
A DSD number can be negated as (5) where " " denotes a 1's complement (invert) function. For example, while number 9 can be represented in DSD as (1001,0000), using (5), can be expressed as either or since in BSD arithmetic , 0 either (1,1) or (0,0) and (0,1). A new DCF addition is presented in [16] . This addition, like its binary predecessors [29] - [31] , limits carry propagation to a small number of digit positions to the left and, therefore, all digitwise addition operations, irrespective of their length, execute in the same time. The adder can be easily simplified if the addends are either two BCD numbers or one BCD and one DSD number.
An interesting feature of DSD arithmetic is that subtraction , where and are two -digit BCD numbers, can be performed with no hardware time delay as , where is in the DSD format. It should be noted that in the remainder of the paper, the word "digit" is frequently used. Where this is not further qualified, based on the context, "digit" refers to a decimal digit, represented in either DSD or BCD.
V. DFP IN IEEE 754R STANDARD PROPOSAL
In the proposed revision to the IEEE 754 standard, the encodings for decimal numbers allow for a range of positive and negative values together with values of , , and not-a-number (NaN) [32] . Three formats of decimal numbers are allowed as follows:
• decimal32 numbers, which are encoded in four consecutive bytes (32 bits); • decimal64 numbers, which encoded in eight consecutive bytes (64 bits); • decimal128 numbers, which are encoded in 16 consecutive bytes (128 bits). A finite DFP number is defined by a sign, an exponent, and a decimal integer coefficient. The value of the DFP number is given by coefficient . In this representation, sign is a single bit (as in IEEE 754 standard for binary FP), exponent is encoded as an unsigned binary integer from which a is subtracted, and coefficient is an unsigned decimal integer. The three DFP formats, decimal32, decimal64, and decimal128, have 7-, 16-, and 34-B coefficients, respectively. The representation format proposed for coefficient uses densely packed decimal encoding [33] . It is a compressed form of the traditional binary-coded decimal (BCD) format. This encoding is a lossless algorithm, which compresses three BCD digits into 10 bits. The algorithm can be applied or reversed using only simple Boolean operations.
Unlike the binary FP representation in which the significand is a normalized number, coefficient has no such limitation. This means that it can take any value between 0 and , where is the length of coefficient in decimal digits.
A. Precision
Precision , is a positive integer, which sets the maximum number of significant digits that can result from an arithmetic operation. The upper limit to it can be the length of the coefficient supported by the decimal representation format, i.e., 7, 16, or 34 digits corresponding to decimal32, decimal64, and decimal128, respectively. There is a lower limit on the setting , which may be the same as the upper limit.
B. Rounding
According to the proposal of the IEEE 754R standard [8] , a final quotient that might be represented in more digits than the standard requires should be rounded to fit the destination format. Among different rounding modes introduced by the proposal, round to nearest even (RNE) is the mode used in the decimal division developed in this paper since it rounds with zero bias 2 [22] .
VI. DFP DIVISION USING SRT ALGORITHM
This section introduces DFP division using the high-radix SRT algorithm, i.e., 10. The proposed method fulfills the requirements of the proposal of the IEEE 754R standard [8] .
Assumptions
• The radix is equal to 10.
• It is known that in high-radix SRT division, to perform the recurrence using a carry-free adder (CFA) [22] in radix , the SD set from which the quotient digit is selected must satisfy [22] . Therefore, for DFP division,
, where is the largest allowed value for the quotient digit. Investigation shows that almost the same propagation delay occurs when generating either set of the multiples of used in the SRT recurrence of DFP division: or . So, for the DFP division described here (6) which is the maximally redundant set corresponding to 1. This choice makes the QDS function simpler and, therefore, reduces its delay.
A. DFP Division Formulation
Considering as the PR and as the precision, the DFP division definition (for nonzero and nonspecial operands) 3 can be expressed as follows.
• The unpacked (BCD) representations of the dividend coefficient, , and the divisor coefficient, , are written in the fraction form as and , respectively. • With the appropriate number of left shifts, the fractions are normalized such that and (7) where and represent normalized and , respectively. Consequently, the dividend exponent, , and the divisor exponent, , should be modified accordingly. Using the new values, the quotient exponent is set as (8) • The decimal QDS function selects the correct value for the quotient digit from DSD set (6) so that the convergence condition (9) is always satisfied.
• The decimal SRT recurrence (10) where is used to generate the next PR. Since the inequality cannot always be guaranteed, in order to satisfy the convergence condition (9), the PR is initialized as (11) The PR is represented in the DSD format and, therefore, (10) can be carried out using DCF addition [16] .
• Using the quotient digits generated in every recurrence, is formed as (12) after cycles. The additional digit is used later for rounding.
• The value of is set as XOR (13)
B. Convert and Round
In the DFP divider, RNE is considered as the default rounding mode. As in binary dividers, after the th division iteration, 3 This simplifying assumption prevents any exception handling. 
Unlike binary rounding, the decimal rounding may encounter a halfway condition [34] . Examples include (15) To deal with this, the final PR is zero detected in the th cycle as
Then, the rounded is determined using the values formed by the on-the-fly rounding algorithm [23] in the , , and registers, where 1 refers to 1 unit of the last position (ulp), and the rules shown in Table I . It should be noted that the final is not post-normalized because the decimal representation in the proposal of the IEEE 754R standard is a nonnormalized format.
C. Dealing With Exact Results
An issue unique to DFP division is to represent exact quotients correctly. Based on the proposal of the IEEE 754R standard, if the final quotient is exact then decimal division should return the coefficient and the exponent of the correct value. The exponent has to be the closest value to the ideal exponent. The ideal exponent is the original dividend exponent minus the original divisor exponent. In our implementation of DFP division, , which is calculated every iteration as if if (17) signals whether the quotient is exact. Once the th PR is found equal to zero, one can conclude that all the consequent quotient digits are zero and, therefore, the division is exact. In this case, can be adjusted by an appropriate number of right or left shifts. Correspondingly, using the value of , obtained from (8) should be adjusted. The circuit performing this operation can be embedded in the CR unit.
D. Decimal QDS Function
This section explains how the comparison multiples method [18] has been used to develop the QDS function of the DFP divider. In this method, the quotient digit's magnitude is obtained by comparison of the truncated PR with truncated multiples of . Meanwhile, another circuit determines the sign of the quotient digit by checking only a few most significant digits of the PR.
Having considered the assumption given in Section VI-A and the normalizations (7), substituting (1) Table II always have some overlaps, where there are two choices for the . Therefore, to make the QDS function one-to-one, every two conjunct intervals are detached using separating points, namely the comparison multiples. The comparison multiple is chosen such that (20) where . The radix-nonredundant number is represented as , where is a (23) followed by 18 sign detections and a coder to be implemented.
E. Optimized QDS Function
The two major ideas to speed up the QDS function, and consequently, the recurrence cycle time, are as follows:
1) breaking the critical path into two or more concurrent, but shorter paths; 2) decreasing the fan-out of the circuits delivering to the QDS function. From (20) and Table II , a symmetry among the margins can be found. Therefore, having used the substitutions (24) . The number of comparisons in (25) decreases to 9; however, (25) cannot be fulfilled unless the sign of the shifted PR is known. Checking the sign of , while represented in the DSD format, needs a full-length carry generation. This causes an impracticable response time proportional to of the operand width. Investigation in Section VI-H shows that existence of the overlaps helps to convert the full-length sign detection to a limited-range. In other words, to make the decision whether or should participate in the comparisons, it is not necessary to know the exact sign of , but the sign of truncated to fractional digits. So, (25) changes to (26) , shown at the bottom of the page. Fig. 1 shows the general structure of the QDS function (26) . The design's subunits are explained later. Fig. 2 shows an implementation for the recurrence of DFP division using the QDS function shown in Fig. 1 . For simplicity, the CR unit is not shown. This design follows the general structure of the high-radix SRT recurrence [15] , [34] ; however, due to the special features of the proposed QDS function, minor changes are applied. In the following, the subunits involved in the DFP recurrence are briefly introduced.
F. DFP Recurrence
1) PR Formation:
In Fig. 2 , the PR formation unit calculates all the possible candidates for the next PR, . They are named corresponding to the values obtained from , where
. These values are kept in registers and in the next iteration, the correct value for the PR is selected using the magnitude of the quotient digit and MUX 11:1.
Considering 1-digit normalizations (7) and the fact that has one integer digit, the PR formation can be constructed using a -digit DCF adder [16] with a DSD addend and a BCD augend. Representation overflow may increase the length of to . A small adjust circuit can be developed based on the following observation to cancel the overflow every iteration [18] . Since the convergence condition (9) is always true, regardless of the exact value of , the 3 most significant digits of the next PR can take one of the combinations , , , , , and , where and .
2) Factor Generator:
The factor generator used in the DFP divider provides the multiples and the 9's complemented multiples of not only in full range, but also in truncated form to supply the comparison multiple generator. Using the following scheme, the appropriate multiples of are produced in full-range in the BCD format as quickly as possible.
1) Having produced in the BCD format just by a wired left shift, value is calculated in the DSD format as . Meanwhile, is produced in the DSD form as . 2) Then, using the values obtained, , , and are calculated in the DSD format as , , and . 3) Afterward, , , and are generated and represented in DSD as , , and . 4) The multiples of represented in the DSD format are reformatted into the BCD form. In this scheme, subtractions with BCD minuend and subtrahend are performed as explained in Section IV. The other addition and subtraction operations can be carried out using DCF addition [16] . DSD to BCD conversion can be implemented by appropriately modifying any BSD to 2's complement conversion approaches [35] , [36] with parallel-prefix carry generating networks [37] .
Due to the complex and time consuming operation performed by the factor generator, it needs an interval equal to one recurrence cycle time to generate and in the BCD format. With this initializing cycle at the beginning of DFP division (i.e., before the first recurrence iteration starts), DFP division produces the final quotient in cycles. 4 
G. Precisions of QDS Function Operands 1) Determining and :
To determine , which denotes the number of fractional digits of (as well as the number of fractional bits of ) involved in the DCF additions, the comparison intervals shown in (26) are studied. For simplicity, only the general interval shown as is investigated. However, the other cases can be derived in the same way. The interval is divided into
It is known that if BCD number is truncated to digits of precision right of the decimal point, then (28) However, for DSD number , the same truncation results in (29) Using (28) and (29), (27a) changes to or simply (30) Since , adding to (30) and using (1) results in . To maintain the convergence condition (3), the inequality or equivalently (31) must be complied with. For interval (27b), a similar derivation can be used to obtain (32) Replacing with in (32) and combining with (31) gives (33) This inequality gives tighter ranges than condition (20) . This is because rather than full-range, truncated comparison multiples are used in the comparisons. To make sure that finding a value for using (33) is always possible, the inequality should always be maintained. This leads to . Since , it follows that or (34) which gives the lower bound on for the comparison multiplesbased QDS function used in DFP division. In addition to , it is required to determine , which is the width of the integer section of the operands involved in the comparisons (23) . Having considered as the largest comparison multiple and , inequality (20) results in . This means that is a BCD number with at most one integer digit. On the other hand, Fig. 2 indicates that the comparators receive , which has one digit in its integer section. These two observations lead to 1. It should be noted that the representation overflow [38] may make the inputs to the comparison sign detectors one digit wider.
2) Determining and : In Fig. 1 , the PR sign detector finds the polarity of represented as a -digit SD number. Since the more digits the PR sign detector checks, the larger its response time is, it is useful to derive the lower bounds on and .
From (29), it can be derived that or equivalently . To make this inequality comply with the convergence condition (3) in the neighborhood of 0 with 0, the inequality must be satisfied. Since , this inequality can be changed to a tighter condition independent to as or (35) To find , one can see that since in Fig. 2 the registers store numbers with no integer digits, (the output of MUX 11:1), can have just one integer digit. Therefore, when is applied to the PR formation, to , which are the candidates for the next PR , will have at most two integer digits. This means that for , which is used by the PR sign detector in Fig. 1,  3 .
H. QDS Function Implementation
Having assumed 4 and 4, the main modules involved in the implementation are explained as follows. 
In Fig. 1 , each comparator " " can be realized using a DCF adder introduced by Nikmehr et al. [16] simplified for a DSD addend and a BCD augend.
2) Limited-Range Comparison Sign Detectors: Unlike BCD numbers, the sign of a DSD number is equal to the sign of its most significant nonzero digit. It means that a DSD sign detection may require an investigation over the length of the input operand. Each comparator in the QDS function is followed by a five-digit sign detector shown as "Comp Sign " in Fig. 1 . To find the signs of the DSD results obtained from (37) , there are nine sign detectors in the system. They can be implemented using networks similar to those used for precalculating the input carries in either parallel-prefix adders [37] or reverse-carry adders [39] . Therefore, having considered the fact that the input operands are very short, a comparison sign detector does not incur large delay. The sign bits are later used to form the magnitude of the quotient digit.
3) Limited-Range PR Sign Detector: The comparisons (37) cannot be accomplished unless the sign of is already known. Therefore, to prevent any additional delay to the iteration response time, the QDS function should be implemented in such a way that the sign of becomes available before (37) starts. Since 0, the sign of is already known before the first iteration begins. Therefore, selecting a value for using (26) is independent of PR sign detection in the first iteration. This observation can be extended to the th iteration to find the polarity of . The PR sign detector is based on (38) . This operation is performed in parallel to the rest of the QDS function and, therefore, does not affect the iteration delay.
The four-digit PR sign detector, which is shown as "PR Sign Det" in Fig. 1 , is implemented in the same way as the comparison sign detectors; however, as shown in Fig. 2 , when the QDS function is used in the DFP recurrence, it is duplicated in ten copies and put after the PR formation. According to this structure, the PR sign detectors find the polarity of , for . These signs are stored in the registers and then in the next iteration, one of them is selected as the correct sign using the magnitude of the quotient digit and MUX 10:1.
4) Coder:
In the proposed QDS function for DFP division, the DSD quotient digit is represented in the sign-magnitude format using and . In this representation don't care if otherwise (39) determines the sign and indicates the magnitude (absolute) of the value selected for . In other words,
Based on the value of , the comparison sign detectors produce nine bits, namely to . These bits, along with , are used by the coder to construct . The values represented by to as well as their relationship with the exact value of are shown in (42) for is required. For a given , (42) is performed as a full length BCD subtraction followed by a twodigit truncation over the result and a one-digit right shift. As shown, the full length subtraction stage incurs an intolerable delay. In the following, it is shown that to obtain for , the calculation (43) can be used instead. If (43) is to be an appropriate replacement for (42) , it must satisfy the convergence condition (9) in the neighborhood of . For simplicity, the investigation through (26) is limited to the positive quotient digits; however, inspecting the negative region gives identical results. Having used the truncation conditions (28) and (29), inequality , which corresponds to , can be expressed as or equivalently (44) To satisfy the convergence condition (3), (44) becomes or correspondingly (45) This is always correct since . Repeating the inspection in the right neighborhood of , where and , gives the same result. This means that (43) is a correct replacement for (42) .
In the comparison multiples generator, all the division (multiplication) by ten operations are performed simply by one BCD wired right (left) shift, in zero time. Also, since the minuends and subtrahends are in the BCD format, as discussed in Section IV, the subtraction in (43) is carried out in zero time as well. However, since the final needs to be in the DSD format, only once during the initializing cycle, DSD to BCD conversion is required after subtraction (43) . The conversion can be carried out using techniques, which are very similar to the ones developed for BSD to 2's complement conversion [35] , [40] - [42] . These techniques have a delay of , where is the width of the DSD operand in digits. The comparison multiple generator fulfills its tasks in the initializing cycle of DFP division (i.e., first cycle).
VII. DFP DIVISION EVALUATION
In this section, the latency and the area of the proposed DFP divider is estimated using a prelayout logical synthesis (Synopsys Design Compiler with Artisan 0.18-m typical standardcell library). It should be noted that the stages of unpacking the input operands and packing the quotient are not considered in the synthesis and, therefore, their delays do not participate in the estimation of the DFP division latency.
Investigation shows that the critical path delay of the design depicted in Fig. 2 is about 2.33 ns. This comprises the delays of 0.23 ns in the register (including the setup and the hold times), 0.85 ns in the multiplexors, and 1.25 ns in the . Assuming the dividend and the divisor to be two decimal128 numbers with 34, then since the proposed DFP divider calculates the final quotient in 37 cycles, the division execution time is 86.2 ns.
The prelayout synthesis estimates an area of 48100 2-input NAND gates for the new comparison multiple based DFP divider. To the best of the authors' knowledge, this is only the second hardware DFP divider to be reported in the literature and the first to use the SRT algorithm. There are, therefore, few points of reference against, which to evaluate the divider; however, to help place it in the context it is interesting to note that one of the latest IEEE 754 compliant radix-4 FP dividers [43] occupies 4780 2-input NAND gates and has a latency of 72.6 ns (estimated using a 0.35-m technology).
The first IEEE 754R compliant DFP divider published is due to Wang and Schulte [27] . It is based on the Newton-Raphson algorithm [14] , which is different from the proposed DFP divider in nature. However, since both dividers accept the same inputs and generate the same output (in terms of the size and representation standard), it is reasonable to compare their latencies.
An estimated critical path delay of 0.69 ns is reported by Wang and Schulte [27] . The timing evaluation is obtained from a synthesis using Synopsys Design Compiler and LSI Logic 0.11-m gflx-p standard cell library, under nominal operating conditions and a supply voltage of 1.2 V.
Wang and Schulte use a sequential fixed-point decimal multiplier [28] in the heart of the divider. It is able to produce between one and four digits per cycle. With a four-digit per cycle multiplier and for decimal128 dividend/divisor, Wang and Schulte report a DFP division latency of 113 cycles, which is equivalent to 77.97 ns. When they simulate the divider under more realistic conditions, i.e., an initial reciprocal approximation lookup table with a reasonable size and a one-digit per cycle multiplier, the Newton-Raphson-based DFP divider requires 246 cycles or 169.74 ns to produce the quotient.
The technologies used for timing evaluation in this paper and in [27] are different. Therefore, the absolute delays themselves could not be utilized for comparison. In order to make the delays comparable, the delays must be represented in FO4 [44] . Nedovic et al. [45] report 1 45 ps measured by Fujitsu Laboratories for the 0.11-m/1.2-V CMOS technology. For the 0.18-m technology, 1 65 ps. Having used these two numbers, the latency of the comparison multiple-based DFP divider is calculated as 1332 FO4, while the fastest and the slowest circuits reported by Wang and Schulte have the latencies of 1733 and 3772 FO4, respectively. Although these latencies include the delays of unpacking the input operands, handling the other rounding modes, and packing the result, the figures still show that the design proposed in this paper is faster than the latest DFP dividers reported in the literature.
VIII. CONCLUSION
Due to increasing demand for decimal arithmetic in financial and banking applications, developing circuits capable of performing arithmetic directly on decimal operands has become an important research topic. While addition, subtraction, and multiplication are much simpler to implement, implementing a divider accepting decimal dividends and divisors in the recently introduced IEEE 754R standard proposal is a great challenge.
A new DFP divider has been introduced. This uses DSD arithmetic, a new type of redundant decimal arithmetic. It also makes use of the comparison multiplies method for quotient digit selection. This method, which has been described both mathematically and as a VLSI architecture, calculates quotient digits in the sign-magnitude format, instead of searching for them in a lookup table. The QDS function receives a truncated PR and investigates the range to which the PR belongs. It performs the examination by comparing the truncated PR with the truncated multiples of the divisor produced once at the beginning of division. The result of the comparisons are delivered to a coder in order to produce the magnitude of the quotient digit. Meanwhile, another part of the QDS function, called the PR sign detector, calculates the polarity of the quotient digit by inspecting the sign of the truncated PR. In the comparison multiples-based QDS function, the PR sign detector operates off the critical path because the quotient digit sign and magnitude are calculated separately.
A logic synthesis with Artisan 0.18-m typical standard-cell library indicates that the new DFP divider calculates the quotient's coefficient in 86.2 ns and occupies an implementation area equal to 48100 two-input NAND gates.
