Abstract. Ancient mathematical formulae can be directly applied to the optimization of the algebraic computation. A new algorithm used to compute decimals of the inverse based on such ancient mathematics is reported in this paper. Sahayaks (auxiliary fraction) sutra has been used for the hardware implementation of the decimals of the inverse. On account of the ancient formulae, reciprocal approximation of numbers can generate \on the y" either the rst exact n decimal of inverse, n being either arbitrary large or at least 6 in almost all cases. The reported algorithm has been implemented, and functionality has been checked in T-Spice. Performance parameters, like propagation delay and dynamic switching power consumptions, are calculated through spice-spectre of 90 nm CMOS technology. The propagation delay of the resulting 4-digit reciprocal approximation algorithm was only 1:8 uS and consumed 24:7 mW power. The implementation methodology o ered substantial reduction of propagation delay and dynamic switching power consumption from its counterpart (NR) based implementation.
Introduction
Nowadays, decimal computation plays a pivotal role in human-centric areas such as nancial and internetbased applications in which exact results are expected. Thereby, hardware implementation of Applying Speci c Integrated Circuits (ASICs) has gained popularity during the last decade [1] [2] [3] [4] [5] . Generally, hardware implementation of the computer arithmetic circuits is based on binary number systems due to simplicity of operations from decimal number systems [6] . Moreover, lots of decimal numbers cannot be represented exactly in binary format due to nite wordlength e ect [6] , hence appropriate representation in binary format has been impractical. Recently, decimal arithmetic has been presented for general-purposed computer [7] with the help of Binary Coded Decimal (BCD) encoding techniques.
Reciprocal approximation plays a pivotal role for several applications such as digital signal and image processing, computer graphics, scienti c computing, etc. [8] . Moreover, division operation can be computed with the help of reciprocal approximation in the following manner: the reciprocal of divisor is computed rst, and then it is used as the multiplier in a subsequent multiplication with the dividend [9] . This method is particularly economical when dividend is varying with respect to the same divisor. Nowadays, reciprocal approximation' methods are typically based on the Newton-Raphson method [10] . Although, due to its poor performance (high computation time), it is more infrequently used than the other two basic arithmetic operations such as addition and multiplication [11] .
Substantial amount of algorithms and their hardware implementation has been reported so far [8] [9] [10] [11] [12] . All of the algorithms are based on either Taylor series [12] or iterative techniques (Newton-Raphson [9, 10] or Gold-Schmidt [11] ). These basic algorithms have been amalgamated, and extensive work has been carried out, and then reported by researchers [8] [9] [10] [11] . The mentioned algorithms have long latencies or large area overhead due to its linear convergence rate, thereby a huge number of operations are required for computation. Moreover, the iterative methods start with an initial approximation of the reciprocal of the divisor, usually implemented through a look-up table, thereby large ROM size is required to accommodate the denominator leading to higher delay and power. Thereby, the desired precision level of the reciprocal unit is limited by the ROM size as the size of the lookup table increases with the demand of precision level.
In algorithmic and structural levels, a signi cant number of algorithms and hardware implementation methodologies have been developed to reduce the propagation delay (latency), which was based on the reduction of the iteration leading to latency reduction, but the principle behind the algorithms are same. In this paper, reciprocal algorithm and its hardware architecture based on ancient mathematics has been addressed. Sahayaks (auxiliary fraction), a Sanskrit term from Vedas, is encountered to realize the reciprocal circuitry. This paper extends the previous paper of the same authors [13] and two others, where the inversion algorithm was introduced for the rst time to discuss circuit implementation of a division unit that uses this algorithm. With the help of the ancient methodology, reciprocal algorithm has been realized by algebraic transformation of the digits to smaller ones, and the overall division has been carried out through the transformed digits; thereby, circuit complexity has been reduced substantially due to reduction in propagation delay. To carry out the transistor level implementation of decimal reciprocal unit, optimized 4221 BCD recording techniques [14, 15] have been adopted in this study. The reciprocal unit is fully optimized in terms of calculations; thereby, any con guration of input could be elaborated. Transistor-level implementation of such reciprocal circuitry has been carried out by the combination of BCD arithmetic with ancient mathematics. Performance parameters, such as propagation delay and dynamic switching power consumption of the reported method, have been calculated by spice/spectre models through 90 nm CMOS technology and been compared with the other design like Newton-Raphson (NR) [7] based implementation. The calculated results revealed that 4-digit reciprocal units have propagation delay of only 1:8 uS with 24:7 mW dynamic switching power consumption.
Ancient methodology for reciprocal computation
The gifts of the ancient Indian mathematics in the eld of mathematical science are not well recognized. Ancient books o ered the mathematical operations which can be carried out mentally to produce fast answers using the sutras. In this paper, sahayaks (auxiliary fraction) for implementing the reciprocal algorithm is presented.
Examples
To fully understand the algorithm, take an example of 1 a9 , where a = 1; 2; ; 9. In the conversion of such an irregular fraction into recurring decimal, ekadhika (by one more than the previous) process can be used. Assume a = 5; thus, we want to calculate the value of 1 59 . Hence, ekadhika purva (one more than the previous) is 5 + 1 = 6. The method of the division has been described in Figure 1 . The description of the chart implementation procedure is described in Table 1 . One more example is given in Appendix A.
Algebraic proof of sutra
Let x = q0 10 + q1 100 + q2 1000 + q3 10000 + be an unknown inverse of number N.
If N = 59, we are going to verify that 6 is the appropriate divisor in the`avalanche' of Euclidean divisions, a = bq + r, where the new dividend is obtained by concatenating the previous rest, r k , with the previous quotient, q k , (or a function f(q k ) of the Subtract the quotient from 9 and concatenate it with the remainder, which gives us 9. Dividing 9 by 6 (1 time, remainder 3) gives quotient 1 and remainder 3.
3 Dividing 41 by 6 (6 times, remainder 5) gives quotient 6 and remainder 5.
Dividing 38 by 6 (6 times, remainder 2) gives quotient 6 and remainder 2.
4 Dividing 56 by 6 (9 times, remainder 2) gives quotient 9 and remainder 2.
Dividing 23 by 6 (3 times, remainder 5) gives quotient 3 and remainder 5. Dividing 56 by 6 (9 times, remainder 2) gives quotient 9 and remainder 2.
6 Dividing 54 by 6 (9 times, no remainder) gives quotient 9 and remainder 0.
Dividing 20 by 6 (3 times, remainder 2) gives quotient 3 and remainder 2. This equality proves that if we take r 0 = 1, x is the desired inverse of 59 with q k s as its decimals as long as all these q k s are less than 10, which is the case here. The general proof is along the same lines with:
r 0 = dq 0 + r 0 10r 0 + f(q 0 ) = dq 1 + r 1 10r 1 + f(q 1 ) = dq 2 + r 2 10r 2 + f(q 2 ) = dq 3 + r 3 10r 3 + f(q 3 ) = dq 4 + r 4 etc:
(here also, q 0 = 0). Function f(::) is to be replaced by one of the appropriate rules; for example, if f qk = 2q k , one obtains the following by the operation having been done before (division by 10, 100, 1000, and then adding): r 0 10 + 2 x 10 = dx , (10d + 2)x = r 0 :
Taking r 0 = 2, one has (5d + 1)x = 1; thus, we have obtained a recipe for getting all the decimals of the inverse of integers of the form 5d + 1. Here, q k 10.
The ancient rule for recipe implementation is given in Appendix B.
Implementation of reciprocal algorithm
Pseudo-code for the implementation of the reciprocal algorithm has been given hereunder. As seen in the pseudo-code, the last digit of the denominator has been calculated through mod-10 operation. If the last digits are 2, 4, and 6, then it is multiplied by 5; if the last digit is 5, then it is multiplied by 2. If the last digit is 3, then it is multiplied by 4. If the last digit is 0, then the right-shift operation is directly performed. If the last digit is 1 or \7, 8, and 9", then the direct implementation of the algorithm is performed. The multiplication process is carried out for digit reduction from denominator. The algorithm will continue until 16 oating point number: The proposed reciprocal algorithm technique, shown in Figure 2 , is used to implement the hardware architecture. Here, basic block diagram is included. First, the input numbers (i.e., dividend) are taken, and they are forwarded to the divisor adjustment unit. The division adjustment unit consists of comparator, adder, subtractor, and multiplier block. When dividend digits are reduced, then division adjustment unit is promoted for the division circuitry. The block level architecture is shown in Figure 2 . Likewise, the reciprocal calculation algorithm is implemented.
Divisor adjustment unit
Divisor adjustment unit is shown in Figure 3 , where the input is the divisor and the last digit of the divisor is propagated to the comparator. When the last digit of the divisor is 9, then ignore the last digit and increment the previous digit by one. When the last digit of the divisor is eight, then ignore the digit and send a signal to a new dividend generation unit for the next iteration. Likewise, when the last digit is 3, then multiply the numerator and denominator by 4. When the last digit is equal to 5, then multiply the numerator and denominator by 2, then ignore the last digit of the denominator and perform the division.
Division
In this section, divider and hardware implementation algorithms are described to increase the speed of the operation. Where input of the algorithm is initialized as divisor (dvs) and dividend (dvd), and output is given as quotient and remainder.`n' is the number of the digits in dividend and`i' is the number of iteration. The ow chart for the algorithm is shown in Figure 4 . The example of the algorithm is shown in Appendix C for this algorithm. The divider implementation technique is shown in Figure 5 . The dividend is assumed to have larger length than the divisor length for simplicity. First, the input numbers of the divisor and dividend are taken from the Most Signi cant Digit (MSD) side. If the MSD of dividend is greater than MSD of divisor, then divide the dividend MSD by divisor MSD; otherwise, the two most signi cant digits taken from the dividend side are considered and divided by the divisor. After division, quotient and remainder are generated. The remainder is concatenation of the next MSD of the dividend and subtracted from the multiplication result of the quotient digit and the least signi cant digit of divisor. If the result is negative, the quotient is reduced by 1 and set the new quotient digit; else it is promoted to the next stage. Likewise, the division algorithm is implemented.
BCD computation
In this paper, we consider the optimized 4221 coding technique for decimal digit representation. As we have mentioned, the use of BCD-8421 to represent decimal digits is expensive because decimal corrections in the partial product reduction binary CSA tree are required to obtain the correct decimal carry and sum. 
Results and discussions
The functionality of the proposed algorithm is examined using spice-spectre simulator. Transistorlevel simulation was performed through spice spectre simulator of 90 nm CMOS technology with`1' volt node voltage, operated at 250 MHz. First, all the circuit modules of full-custom cells are used, and nally the complete architecture that combines all the modules is simulated, so that the decoding performance can be considered and re ected in the results. The simulations have been conducted for all possible bit combinations. The performances, shown in the result section, are the worst-case scenarios when delay and consumed power are maximized for any speci c bit combinations. More examples for the calculation of the reciprocal of numbers are given in Appendix C. The reported methodology revealed that the application of the ancient mathematics for reciprocal implementation reduces the number of iterations. Thereby, hardware usage decreases; as a result, propagation delay and dynamic switching power consumptions decrease.
Error analysis
The computational error can be de ned as: E r = exact value assumed value exact value 100:
The reciprocal of the divisor, 1 N , is calculated using the Newton-Raphson iterative method. The rst iteration uses an initial seed, herein obtained using a piecewise linear approximation based on minimax polynomials. The method converges quadratically, that is, the error of the approximation decreases quadratically with the number of iterations. Calculation of the error in the proposed algorithm is described here:
Here, the error is:
Percentage of error is equal to: Computational error, E r , can be minimized if 3 ) is added to or subtracted from the approximated result. The comparison chart based on the proposed methodology is given in Figure 7 . In Figure 7 , comparison of the errors and the N-R methodology are described. MATLAB programming has been implemented using IEEE single precision format, and the value has been calculated for di erent digits. The error graph shows the approximate and average errors. We have taken the algorithm (N-R) [10] from references and computed it in the same environment for calculation. The number of exact decimals provided by the algorithm is shown in Figure 8 
Comparison
The performance parameters, such as propagation delay and dynamic switching power consumption, are shown in Table 2 as a function of input number of digits. Input data are taken as possible digit combination for experimental purposes. We have kept our main concentration on reducing the performance parameters such as propagation delay, dynamic switching power consumption, thereby energy delay product.
Proper modi cations for device, circuit, and ar- chitectural levels of design hierarchy have been analyzed properly for reducing propagation delay and average dynamic power consumption. The values of delay and power of di erent architectures are measured. Pass transistor/Transmission Gates (TG) are used for the design of di erent modules for faster operation and better logic transformation. The basic di erence of pass-transistor logic compared to the CMOS logic style is that the source side of the logic transistor networks is connected to some input signals instead of the power lines. The advantage is that one pass-transistor network (either NMOS or PMOS) is su cient to perform the logic operation, which results in a smaller number of transistors and smaller input loads, demonstrated in high speed and less power consumption [16] . For each transition, the delay is computed by 50% of the input voltage swing to 50% of the output voltage swing. The propagation delay and the power consumption have been measured with the assumption of the worst-case pattern and from the output where the delay is maximized. Individual circuit module has been simulated, and nally the complete circuit module has been carried out in a similar approach. For comparison purposes, the architectures have been taken from di erent references [9, 10] and implemented using technological environment.
A comparison between di erent architectures in terms of propagation delay and dynamic switching power consumption is also shown in Table 2 . Simulation results for 4-digit reciprocal of a number o ered 25% speed compared with N-R iteration-based [10] architecture. Moreover, the improvement in terms of switching power is 9% in the same environment.
Conclusions
In this paper, a new algorithm for the computation of the decimals of the inverse based on ancient mathematics is reported. By employing such an ancient methodology, decimal reciprocal has been implemented by the transformation of the digits into a smaller one. Moreover, division has been carried out through smaller (transformed) digits. Transformation o ered the reduction of circuit-level complexity, owing to the substantial reduction in propagation delay. The functionality of these circuits is checked, and the performance parameters, such as propagation delay and dynamic power consumptions, are calculated through spice spectre of standard 90nm CMOS technology. Simulation results for 4 digits reciprocal of a number o ered 25% reduction in terms of propagation delay compared with N-R iteration-based [10] architecture, whereas the corresponding improvement in terms of switching power is equal to 9% in the same environment. Dividing (r 10 + q 2) 300 by 39 (7 times, remainder 27) gives quotient 7 and remainder 27.
6 Dividing (r 10 + q) 586 by 79 (7 times, remainder 33) gives quotient 7 and remainder 33.
Dividing (r 10 + q 2) 284 by 39 (7 times, remainder 11) gives quotient 7 and remainder 11. Appendix C
The example of the divider algorithm is shown in Figure C .1, and chart implementation procedure is shown in Table C. 1. Figure C .1.
1) One digit of divisor, i.e.,`8' (MSD), has been put a little down compared to the rest of divisor digits (i.e.,`3'); this is going to be the actual divisor for subsequent division process. Since 1st digit of dividend (`4') is less than divisor (`8'), take two digits of dividend, i.e.`49', as temporary dividend. After division, we get`6' as 1st quotient digit and`1' as remainder at the completion of 1st stage.
2) In 2nd stage, temporary dividend (i.e.,`00') is generated by concatenating the remainder of 1st stage (i.e.,`1') with the next unused digit of actual divisor ( i.e.`8') and subtracting it by the product of the rest of divisor digits,`3' and quotient digit`6' of the last stage. After division, we get`0' as 2nd quotient digit and`00' as remainder at the completion of 2nd stage.
3) In 3rd stage, temporary dividend (i.e.,`07') is generated like in 2nd stage, and we get`0' as 3rd quotient digit and`07' as remainder at the completion of this stage.
4)
In 4th stage, temporary dividend (i.e.,`73') is generated like in previous cases. But in this stage, we do not do division and stop the procedure since stopping criteria are met (i.e., the last digit of actual dividend has been used). Thus, we get`600' as quotient and temporary dividend,`73', as remainder. 
