Abstract-This paper presents a Radix-100 divider based on decimal non-restoring and selection by truncation method. Two decimal quotient digits can be selected in each iteration, which can reduce half of the iteration cycles. Initialization is required to scale the divisor into a pre-calculated range, and also used for generating some multiples of the scaled divisor. Implemented with STM 90-nm standard cells library, the proposed architecture takes 14 clock cycles, which is 373 FO4 to reach the desired accuracy. The latency is much shorter than Radix-10 dividers.
INTRODUCTION
Division though is the less frequent used operation, is the most complex and most time-consuming amongst the four basic operations (i.e., addition, subtraction, multiplication and division). This fact reveals the incentive to design high-speed division hardware algorithms so as to enhance the performance of the arithmetic processors.
The processor industry, however, has recently taken some actions to support decimal floating-point (FP) number system for human-centric applications such as financial, commercial and management; just to name a few. Moreover, this revitalized number system is included in the IEEE FP standard. The main reason for these extensive supports lies within the fact that decimal FP is able to mirror the human computations which cannot be precisely performed by binary FP.
Continuing the trend of hardware support for decimal FP, some attempts have been done to design decimal dividers [1] , [2] , [3] , [6] , [7] . As a way of improving the performance of decimal division hardware one can strive for high-radix decimal division algorithms. This is the solution (i.e., developing a radix-100 decimal divider) we focus on in this paper so as to ameliorate the slow decimal division operation.
However, the well known SRT based division algorithms [1] , [2] , [6] , [7] are unacceptable for Radix-100. The main reason is that the number of potential quotients in Radix-100 is tremendous. SRT needs almost the same number of comparison as the number of quotients before making a decision. Even if adopting binary high-radix division ideas into Radix-100 by overlapping [4] or cascading [5] Radix-10 SRT dividers, the cycle time would still be far from satisfied. Radix-100 division needs a different method other than the popular SRT.
The idea of Radix-100 divider can be accepted only if it has a good cycle time and save a great deal of latency. A nonrestoring method combined with selection by truncation method [3] has gained our attention. Although performing moderately in the field of decimal division, it has an inherent advantage which is a simple quotient digit selection (QDS). By adopting that algorithm in Radix-100 divider, two-digit quotient can be selected with simple combinational logic at the same time. Although selection by truncation takes more time in the pre-scaling step compared with SRT, almost half iteration cycles are reduced compared to Radix-10 dividers, which can save a great amount of time in the end. Besides, aiming at the great delay of adding the partial reminder, as depicted in [3] , carry-save adders (CSA) are utilized during the whole iteration. This paper is organized as follows: Section II describes the Radix-100 algorithm based on pre-scaling and selection by truncation. Section III depicts the architecture which is used to implement the proposed Radix-100. Section IV presents the synthesis results and comparison with some other papers. The conclusions part is in section V II. ALGORITHM
A. Radix-100 Division Algorithm
According to the IEEE-754R 64-bit standard, here, our Radix-100 algorithm is based on a 16-digit dividend and 16-digit divisor, and the result should have a 16-digit accurate quotient.
The radix-100 digit-recurrence division algorithm is described in the recurrence equation (1) where , ′′ and are the i th partial remainder, scaled divisor and quotient digit, respectively. Moreover, we assume the normalized divisor and dividend ( 1 ≤ , < 100 ) and allot the maximally redundant digit-set to the quotient (i.e. ∈ [−99, 99]).
Applying the selection by truncation algorithm (i.e. +1 = (100 [ ])) imposes the divisor to be scaled to a value close to 1 so as to satisfy the convergence condition shown in (2) , where the scaled divisor ′′ is in the range of [1, 1 + ) [3] .
The first and probably most awkward part is to determine the value of such that equation (2) holds. The critical case occurs when ′′ and +1 takes their maximum value which, according to (2) , leads to equation (3) .
Therefore, the divisor should be scaled to1 ≤ ′′ < 1.01. Consequently, the dividend should be scaled by the same prescaling parameter so as to have a correct final quotient.
Besides, in order to make all partial reminder be converged according to (2) , the dividend, which is 0 , should be divided by 100, since

[0] 0.01 1 100
B. Prescaling Parameter
Direct scaling coefficients are stored in a look up table (LUT) and a few most significant digits (MSD) are used as inputs to that table. Decision on the number of MSDs should obey the equation (5), where ′ will be discussed later and can be treated as here. Since 3 is the smallest number of MSDs satisfying (5), the LUT is constructed based on three digits' input.
Theoretically, there are 900 parameters corresponding to three MSDs ranging from 10.0 to 99.9 needed to be stored in the table, which would be both area and time consuming. In order to decrease the size of the table, some divisors within specified regions (Table I) will be adjusted to the range of [50,100) by multiplying 5 or 2 through simple multiple logic, so as to reduce the number of parameters to 600, where 100 parameters are in the range of 20 and 30 while the other 500 parameters fall into [50,100). 
C. Initialization
In the initialization step, a list of values are calculated which can be used during iteration in order to save a great latency.
 Pre-scaling parameter is selected.
 Divisor and dividend are pre-scaled based on Decimal CSA tree.
 Multiples (-1 ′′ to 5 ′′ ) of prescaled divisor are generated for use of iteration  Calculation of the compensation values (will be introduced in section D)
D. Quotient selection
The quotient of Radix-100 divider can be represented as equation (6) . So, the algorithm selects two quotient digits in each iteration.
q=10 qH + qL (6) By utilizing the selection by truncation method, the quotient can be determined relatively easily compared to other methods. However, the generation of q ′′ is time-consuming since there are 19 (-9~9) potential quotients for both qH and qL. Since 1 ′′, 2 ′′, 4 ′′, 5 ′′ can be calculated through simple and fast multiple logic while 3 ′′ can be got by adding 1 ′′and 2 ′′, which would only take 2 cycles computation in the initialization to generate those multiples of divisor, we only involve [-5~5] ′′ during the iteration, and use compensation method to ensure accuracy. Similar to [3] , the basic idea of compensation is to transform the quotient in the range of [-9~-6, 6~9] to the range of [-5~5] by adding or removing 10 ′′ in advance. In the particular case of Radix-100, where quotient q is treated as equation (6), the compensation scheme should has the following values±110 ′′, ±100 ′′, ±10 ′′.The selection of compensation value is also performed by combinational logic, shortly after the 2-digit quotients are determined.
In [3] , the partial reminder is added up through CLA before determining the quotients, which would increase the cycle time tremendously. However, here, only two carry-save format quotient digits and one carry-save format sign digit are checked through a combinational logic unit while other digits are added up through a decimal prefix tree in parallel. Combined with the compensation scheme, we can restrict the influence of the carry generated by the least significant digits only on the qL, which means two qLs, one is selected by assuming the carry from the lower digits is 0 while the other qL is determined by involving the carry, are generated before the carry is calculated. The carry is used to select the right qL before putting those values into decimal carry-save adders. Sign digit is also checked in carry-save format. The partial reminder is considered as minus if the sign digit is 8 or 9 with no carry passed from quotient digits. After analyzing, the sign determination should has nothing to do with the carry produced by the parallel decimal prefix tree, since the only possible situation that carry would influence sign is 9.99 (sign digit and two quotients are all equal to 9), which has the same affects as 0.00 in this algorithm.
III. ARCHITECTURE
The whole architecture of the proposed algorithm is illustrated in Figure 1 . Details of the dashed circles are depicted in Figure 2 & 3. As shown in Figure 2 , in cycle 1, the original 16 digits divisor is passed through multiple logics to generate ′ according to the process defined in Table 1 . Three most significant digits of ′can select three digits' pre-scaling parameter, while at the same time, 2 ′ and 5 ′ are generated before their negative values are calculated in 9's compliment.
By adding 1 to the 9's compliment value, the desired 10's compliment numbers will be produced, which will be done in the carry-save adders shown at the bottom of Figure 2 . Part of the generation of pre-scaled divisor ′′ is also done in the second cycle. Each digit of the pre-scaling parameter will go through a MUX to choose proper multiples of ′. Then, those multiples will be added up through four levels of decimal carry-save adders, three of which will be finished in cycle 2. In cycle 2, the dividend will reuse the architecture in Figure 1 except that the pre-scaling parameter is imported from input instead of selection from the parameter table. Same as the generation of the pre-scaled divisor, the computation of the pre-scaled dividend will be finished in cycle 3.
The sum and carry of the pre-scaled divisor ′′ are added by a decimal prefix tree in cycle 3. The "Multiples of Sum" produced by block "Multiple" in Figure 1 stands for 1 ′′ , 2 ′′ , 5 ′′ , 4 ′′ and 9's compliment -1 ′′.
In cycle 4, the same decimal prefix tree is utilized to add 2 ′′ and 1 ′′, so that 3 ′′ can be generated and participate in the iteration directly. Meanwhile, 110 ′′ , -110 ′′ are generated by two levels of carry-save adders in parallel (through the rightmost CSA). The other compensation values can be realized by shifting 1 ′′or -1 ′′during the iteration.
The main architecture is in Figure 3 , where all iterations cycles are performed. In the first iteration cycle, the scaled carry-save format dividend is selected. Otherwise, the partial reminders from the last iteration are selected as PartialS and PartialC. The "digit recognition" unit can "guess" the value of each quotient and the sign digit so that three digits' quotient (qH and two potential qLs) and the polarity can be estimated without considering the carry from those least significant digits. During the processing of digit recognition, the 9's compliment negative values of the multiple divisors are generated via "Negative" unit in Figure 3 . The 1 needed to be added on those 9's compliment numbers will be considered latter.
The results of the digit recognition go directly into two MUXs. The candidates of those MUXs are the multiples of the scaled divisor, which are generated in the initialization step as well as the "Negative" block. Simultaneously, the compensation value will also be selected by "Compensation" unit in the form of sum and carry. The carry of the least significant digits is generated almost at the same time when the qL1 ′′ and qL2 ′′ are selected. Therefore, the correct qL ′′ is chosen before entering the addition step. Having reached this point, there are 4 sums and 2 carries ready to be added, which are qH ′′ , qL ′′ , compensation sum CS, compensation carry CC, combined with the carry-save format PartialS and PartialC. Besides, if qH ′′ or qL ′′ is negative, 1 or 2 "ones" should be added as well. That will be done in the "9's compliment transformation" unit depicted in Figure 3 .
The additions are performed in three levels of carry-save adders. In the first level, 2 carry-save adders are in the architecture, while the "ones" are "added" by a MUX. In the second level, another MUX is utilized for the addition of the previous MUX "Sum" and the other carry. Finally, One CSA produces the partial reminder, Out_PartialS and Out_PartialC.
The values of quotients are considered with sign digit, carry digit by the rounding unit based on the on-the-fly architecture. After all iterations are finished, an addition cycle is needed for the final rounding. Overall speaking, 4 cycles are used for initialization while 9 cycles are needed for iteration. Rounding the 18 quotient digits to 16 digits will take another cycle. The algorithm needs 14 cycles in total.
IV. IMPLEMENTATION AND COMPARISON
The proposed 16-digit Radix-100 divider, as described in the last section, is modeled with Verilog and implemented by using the STM 90-nm CMOS standard cells library with the typical condition and Synopsys Design Compiler. The critical path falls in the iteration architecture as drawn in Figure 3 .
The cycle time of our implementation is 1.2 ns (including reg time). And the number of cycles is 14. Consequently, the overall latency is 16.8 ns, which equals to 373.3 FO4 in the circumstance of 90-nm standard library.
Since [3] is a Radix-10 implementation with the similar algorithm, its implementation results are comparable. The environment of that implementation is the Power 6 5GHz 13-FO4 cycle time. It takes 82 cycles to reach a 16-digit's accuracy. The latency is 1066 FO4, which is almost 3 times larger than ours. Except the fact that CLA is used in calculating the partial reminder, the selection by truncation algorithm works better in Radix-100 since with some sacrifice in the initialization time, the number of iterations can be reduced a lot, which results in a better latency. Tomas Lang's work [1] is among the top in the field of SRT based Radix-10 division and it is synthesized in the same library as our work. It reached 20ns latency with 1ns per each cycle, which is 444 FO4. Our Radix-100 divider is 16% quicker than Tomas Lang's one.
The comparison shown above and some other implementation results can be found in Table 2 . All the results are based on FO4, which is relatively fair than timing data based on different synthesis environment. The data in the table can prove the Radix-100 divider implemented by us is much faster than those Radix-10 dividers, while maintain the 16-digit precision. A notable feature of Radix-100 can be concluded based on the column showing the number of cycles for any decimal standard. Since the number of iteration cycles of Radix-100 is half the numbers of most Radix-10 algorithms, larger precision means that more cycles can be reduced. Therefore, Radix-100 divider is far beyond to reach of those Radix-10 dividers if Decimal128 (I=34) is required. V. CONSLUSION Due to the increasing demand of fast and precise division, a Radix-100 division algorithm based on a combination of non-restoring and selection by truncation is presented and implemented in this paper. The most advantage of the algorithm utilized, compared with more common SRT based method, is the simple and quick QDS scheme. Two quotient digits can be determined at the same time. Therefore, a great number of iteration cycles are saved, resulting in a small latency.
A table of scaling parameters is built for the use of prescaling. Some techniques help to reduce the number of parameters from 900 to 180. Multiples of scaled divisor are generated in the initialization part, which can save iteration time significantly. The initialization will take relatively larger number of cycles which is still much less than the cycles reduced by the Radix-100 method. To further limit the number of multiples required in the iteration, a compensation method is implemented. By adding or removing some specified compensation values, the multiples are restricted within -5 and 5, which are relatively easy to calculate in the initialization.
The ultimate goal of the proposed divider is a small latency. The proposed algorithm is synthesized with the STM 90-nm standard cells library. 16-digit correct quotients can be produced after 14 cycles, and the latency is 16% faster than [1] , which is much faster than other Radix-10 dividers.
