Increasing development in embedded system, VLSI and processor design have given rise to increased demands from the system in terms of power, speed, area, throughput etcetera. Most of the sophisticated embedded system applications consist of processors; which now need an arithmetic unit with the ability to execute complex division operations with maximum efficiency. Hence the speed of the arithmetic unit is critically dependent on division operation. Most of the dividers use the SRT division algorithm for division. In IoT and other embedded applications typically radix 2 and radix 4 division algorithms are used. The proposed algorithm lies on parallel execution of various steps so as to reduce time critical path, use fuzzy logic to solve the overlap problem in quotient selection; hence reducing maximum delay and increasing the accuracy. Every logical circuit has a maximum delay on which the timing of the circuit is dependent and the path, causing the maximum delay is known as the critical path. Our approach uses the previous SRT algorithm methods to make a highly parallel pipelined design and use Mamdani model to determine a solution to the overlapping problem to reduce the overall execution time of radix 4 SRT division on 64 bits double precision floating point numbers to 281ns. The design is made using Bluespec System Verilog, synthesized and simulated using Vivado v.2016.1 and implemented on Xilinx VirtexUltraScale FPGA board.
I. INTRODUCTION
For a long time, the chip industry was dependent upon the personal computers and its application. The heart of the PC (personal computers) consists of various devices that follows Moore's Law. The miniaturization of device geometry has changed the landscape of platform of computing. Personal computer's influence on IC design is slowly abraded by the flourishment embedded systems, mobile and portable devices. Embedded systems now dominate every aspect of our life with portable handsets, mobile, tablets, smart IoT based devices etc [1] .
The advent of ubiquitous computing brings with it various challenges-at the technology level, there are several unresolved technical issues concerning the design and implementation of computing architectures that enable dynamic configuration of ubiquitous services on a large scale. [2] .
Portability, power efficiency, computational heterogeneity, throughput, speed are various parameters and areas of concern in the design of embedded systems. A significant amount of change in the architecture and algorithm is required to align with the design constraints and requirements as mentioned earlier. Ironically, even after fifty years of enormous research into altering and changing almost every technology in integrated circuit design, the fundamental arithmetic operations and algebraic structures used in the prevalent embedded systems are still based on the convention arithmetic units used in personal computers. There is a dire need for increased parallelism and modularity in various operations in an arithmetic unit. Further, the most complex operation in the arithmetic unit is a division operation taking up the maximum time and number of cycles. Division is vital and extensively used almost every computer architecture in a microprocessor. Although its occurrence is rare, the performances will deteriorate considerably if division occurs. Traditionally division in most floating point arithmetic units is implemented mainly by either Goldschmidt's/Newton Raphson algorithm or SRT division method [2] .
The Newton Raphson method makes use multiplicative methods utilizing FPU multiplier and almost requires no additional hardware; hence increasing the throughput due to reduction in latency. Whereas SRT algorithm uses subtractive methods utilizing a separate hardware for subtraction and shifting in each cycle. Although this increases the hardware and latencies, parallelism is achieved increasing the speed and computational efficiency. With the increase of transistors on a chip as predicted by Moore's law and decrease in price of the same gave subtractive methods are widely used as the standalone hardware in [3] [4] the derivative of SRT algorithms gives better performance, throughput and computational efficiency. The famous Intel Pentium bug was also due to few incorrect entries in the QST table of SRT radix 4 division algorithm. [5] SRT division is a recursive method producing predicted quotient digit in redundant form at every cycle. The speed depends upon the number of cycles required for the computation to finish. The speed increases as we increase the radix. However, the complexity of quotient digit selection also increases, hence consequently the size of the quotient selection table increases. This paper presents a novel approach of designing and implementation of a highly modularized and parallel SRT radix 4 division algorithm where the quotient digit is predicted based on the dividend and then corrected using fuzzy logic This reduces the size of QST drastically and hence performs better than a traditional SRT radix 4 dividers with or without prescaled divisor and dividend. The divider is further interfaced with 64-bit RISC processor, simulated and synthesized on Vivado and tested on Xilinx VirtexUltrasparc board. Section 2 of this paper explains the basic SRT algorithm and its conventional methodology. Section 3 explains the quotient prediction and correction algorithm. Section 4 explains on the fly algorithm and section 5 gives the results obtained followed by a conclusion.
II. SRT DIVISION ALGORITHM
The core algorithm of division is a trial and error prone process requiring few initial guesses of a quotient digit followed by subtraction; if the remainder is greater than divisor then the predicted quotient is incorrect and the process is repeated again, discarding the previous result(predicted quotient). In building a computer arithmetic unit, as said earlier, division is the most difficult basic operation to implement in terms of complexity, time and hardware implementation.
SRT division algorithm to implement the division operation in arithmetic unit was given by D. Sweeney, J.E. Robertson [6] and K.D. Tocher [7] , in the late fifties simultaneously. Further and the introduction of use of the redundant representations for the remainders was given by D.E. Atkins [4] .
SRT division is a recurrent algorithm producing-fixed number of bits, which is equal to the quotient digit in redundant form, at every cycle. For SRT division scheme the recursive relationship is defined by [8] :
where the symbols are defined as follows: the recursive index from .
the partial remainder used in the th cycle the partial remainder used in the ( th cycle the dividend the remainder quotient digit the number of digits, radix , in the quotient the divisor the radix
For -bit division, where , it takes iterations to compute the final quotient and remainder. The division process needs a quotient selection table to determine the quotient and the remaining components. Eq. l is partially used to calculate the same. Higher radix SRT division is implemented in such a way that its quotient is selected from a digit set . Where u is an integer such that 1/2 (r-1 ) ≤ u ≤ r-l, and r is the radix.
The number of bits produced during quotient selection determines the radix i.e n bits quotient digit prediction in each step is radix-n division method. Now, the speed of the operation is directly dependent upon the total number of cycles required for the computation to finish. The speed increases as we increase the radix. However, as we increase the radix -the complexity of quotient digit selection also increases and so does the quotient selection table. In this work the quotient digit is predicted based on the dividend and fuzzy logic; then is corrected if the initial guess was incorrect using non-redundant digits of the dividend. This reduces the size of QST drastically and hence performs better than a traditional SRT radix 4 dividers with or without prescaled divisor and dividend.
The biggest problem with the traditional approach of SRT division is the prediction of quotient digit in the overlapping regions. These overlapping regions are due to same region corresponding to different coefficient. The quotient digit in the region can be either or r . This implies that we have a choice of values, of both the partial remainder and the divisor that will eventually separate these two adjacent regions. Peter et al [9] suggested in their paper to use the steps in the overlapping regions and if the ambiguity comes up in the overlapping region the step function will determine the final outcome. This however is extremely complex because the step function determination is not generally for all the overlapping regions. As can be seen from the graph the number of steps increases with increasing radix making it more and more complex to execute. Further the Fig. 1 . PD Plot for selection logic embedded into quotient selection table will become extremely long and implementation will lead to extra hardware, high power consumption and it may violate the critical time violation assumption. This problem can solved using fuzzy logic decision making.
III. PROPOSED ALGORITHM
The proposed algorithm lies on parallel execution and increased modularity coupled with smarter decision making algorithm (fuzzy logic) in various steps so as to reduce time critical path and hence reduce maximum delay.
Every logical circuit has a maximum delay on which the timing of the circuit is dependent and the path causing the maximum delay is known as the critical path. Our approach uses the previous SRT algorithm methods to make a highly parallel pipelined design to reduce the overall execution time of radix 4 SRT division on 64 bits double precision floating point numbers to 281 ns.
The basic concept used in [10] [11] is that the quotient selection algorithm is reduced to two reduced partial algorithm. However, in the presented faster algorithm similar methodology is highly modularized and parallel. The overlapping problem is solved by increasing the overlapping area to get a uniform overlap within each set of quotients as proposed and illustrated in fig.2 .
Fig. 2. Uniform overlapped region PD Plot for
Instead of selecting the correct quotient digit q ,-a ≤ q ≤a , which gives rise to a graph as shown in fig.1 . We estimates a quotient digit q#, ,-a ≤ q ≤ a-l, }, such that the actual quotient digit is either q# or q#+ l. [12] . This makes the overlapping region uniform for each consecutive quotients, making the overlapping decision easier. Fig.3 shows the common uniform overlapping region. For estimated quotient digit q # , the upper and lower limits for the corresponding partial remainder are:-
Fig3. common uniform overlapping region
Now in the overlapping region the decision problem is solved using fuzzy logic. We studied various fuzzy sets and methods applied in various domains [13] [14] [15] . After proper analysis of the input fuzzy set, we figured that there are several ways to determine the output answer based on the inputs, mainly the Mamdani, Larsen, Takagi-Sugeno-Kang, and Tsukamoto inference and aggregation methods are widely used for such problems. We apply Mamdani inference model to the overlapping region Let us pick an input value that has membership function in both q # and q # +1 region, P 0 =0.01, this will cause both rules to fire. The value 0.01 has a membership of 0.75 in q # +1 and a membership of 0.25 in q # . Using the Mamdani model and these inputs the resulting aggregate output will be: When all of the possible permutations and combinations have been made, the total output membership function (green), is as shown below: This decision hence will now determine the quotient correction.
In a nutshell, the interim quotient q # and correction quotient is calculated and in parallel with new dividends/partial remainder. These are calculated for both the interim quotient and the same incremented one by one as functions outside the main body-partrem1 and partrem2 respectively.
Further on the fly algorithm is implemented using a minimalistic approach involving conversion table reduced to four rows and two columns.
A. Interim Quotient
Here we take interim quotient such that which implies the quotient can be either or . As a consequence range of partial remainder is:
Since, , hence we get the following value of , Hence, as discussed in [8] [10] [11] , the PD plot will change as shown below increasing the overlapping area and hence introducing uniformity. Of the overlapping region as shown in the fig below: The graph is divided into 4 regions exploiting the uniformity of overlapping regions. This partition is done using three horizontal line partitions as shown in Figure . The first one being X-axis and p g .
. This makes the decision logic of interim quotient low latency and faster than the traditional complex methods. However the decision is incorrect in overlapping regions.
B. Correction Quotient
From figure 4 it is derived that the correction quotient is a step function. Hence, from the graph in fig 4, the following equation can be derived:
The algorithm implemented in Bluespec System Verilog and dumped on board is as follows: 
C. Partial Remainder
Since the quotient can be either Q or Q+1 (which can only be determined after the quotient correction is calculated) we in parallel calculate two possible remainders P0 and P1. This is done so that the critical path can be reduced by parallel calculated both the possible remainders and then calling the appropriate one after the correction quotient is calculated:
The parallelism does not affect the critical path and hence increases the speed. If the quotient correction was 0 then P0 is used, otherwise P1 is used.
D. On the Fly Algorithm
Unlike restoring or non-restoring division methods, SRT division produces redundant digits as a result which needs to be converted into a non redundant digital set. This can be done by subtracting the positionally weighted negative digit quotient from positive counterpart. However, it will require a carry propagation subtraction unit which can be eliminated if we use on the fly algorithm. In this case the extra hardware will be required [16] . However, in our implementation we have the traditional algorithm has been shortened to a reduced look up table which runs parallel with each iteration. As a result, our derived table is superior in terms of time for parallel computation of non-redundant digit is lesser than the time taken for each iteration and hence it the critical path isn't affected.
The implementation requires two registers and separating the digit vector to and for positive and negative quotient digit respectively [17] . For specifically radix 4 division with each cycle two quotient digits are being predicted. So the quotient digit obtained of base 4 will be converted into non redundant representation resulting in appending two binary digits to and i.e., Appending The critical time came out to be 210 ns which is significantly lesser than earlier proposed algorithms. Synthesis and simulation showed significant improved results of the architecture in the critical time period, speed, power and area consumed compared to other algorithms proposed. Further the frequency obtained is approximately 1.5 GHz. The board resource utilization is given in the figure 6as follows.
V. SCOPE AND FUTURE WORK
The algorithm can further be extended to higher radices to implement on higher end processor like S class, M class, H class and T class. For lower radices i.e., Radix 2 which is used in lower end embedded systems a tradeoff will be made and taking into consideration the requirements of the system radix 2 with parallelism and additional hardware will be implemented. By changing the typical SRT algorithm with the variant of parallel SRT in FPUs used in embedded system the speed will increase considerably.
VI. CONCLUSION
This paper presents a novel approach of introducing parallelism in all the paths with latency lesser than the critical path. This approach reduces the critical time and hence worst slack greatly; hence giving faster and more efficient FPU. Furthermore the proposed algorithm implemented on FPUs can be used in arithmetic units for SoC in smart and sophisticated embedded systems and processors or for Intellectual property in a block of logic of VLSI systems
