Abstract-We investigate a Goldschmidt's single flux quantum (SFQ) floating-point divider that is suitable for the implementation using the bit-serial pipelined SFQ circuit. We designed the bit-serial SFQ divider that outputs the 11-bit quotient. The simulation results show correct operation at the frequency of 50-GHz and bias margin of 80%-125%. We also estimated dependence of latency and the number of Josephson junctions on accuracy of outputs. According to the estimation, double-precision SFQ divider can be designed using 27 000 Josephson junctions, which can be implemented on one chip using the current SFQ circuit fabrication technology. Because of the small circuit area and multiplier-based hardware architecture, the investigated divider can be applied to build an SFQ graphical processing unit.
I. INTRODUCTION
A SINGLE flux quantum (SFQ) logic circuit has the potential for high-speed and low power operation compared to the semi-conductor CMOS circuit [1] . The SFQ logic circuit is thought to be applied to the next generation high performance computing (HPC) systems because of the high-energy efficiency [2] . In the field of HPC, a general-purpose computing on graphical processing units (GPGPU) is becoming mainstream because of its high performance in terms of calculation power and energy efficiency [3] . In the graphics processing units (GPUs), floatingpoint arithmetic operation plays the important role compared to general-purpose processors. Among four arithmetic operations, division is the most complex and requires the longest calculation time. It is reported that floating-point division accounts for approximately 40% of overall calculation time, although division occupies only 3% of the total instructions to perform the SPECfp92 benchmark [4] .
We have been studying the GPGPU-based SFQ processor to implement high-speed and energy efficient future computing system. To implement the GPGPU-based processor, implementation of the floating-point arithmetic unit that can efficiently perform four arithmetic operation, addition, subtraction, multiplication, and division, is important. So far, addition and The authors are with Department of Electrical and Computer Engineering, Yokohama National University, Yokohama, Kanagawa 240-8501, Japan (e-mail:, sanada-akiyoshi-sk@ynu.jp).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TASC.2019.2902800 multiplication based on the SFQ logic have been studied using various hardware architectures, such as a bit-serial architecture [5] - [8] , a bit-slice architecture [9] , [10] , a bit-parallel architecture [11] , [12] . The SFQ divider, which employs a systolic-array architecture, has been studied [13] . Though this divider has a good scalability, large circuit area is required to scale up the bit-length because the number of processing unit cells is proportional to the number of bits. In this study, we investigate a hardware algorithm of the SFQ divider, which is suitable for implementing SFQ-based GPU, based on the bit-serial approach that can be implemented with small circuit area.
II. GOLDSCHMIDT'S ALGORITHM
There are two major hardware architecture approaches to implement the divider, subtractive and multiplicative approaches [14] . The subtractive approach, which is similar to the manual division calculation, uses subtraction and shift operations to calculate quotient. The calculation time of the subtractive method is relatively slow because the subtraction and shift sequence has to be repeated many times, at least the bit-length of the dividend. Owing to simplicity of the calculation and implementation, the subtractive approach has been widely used in commercial microprocessors [15] . The multiplicative approach uses multiplication to calculate the quotient on the basis of mathematical algorithm. In this approach, the input dividend converges to the quotient quadratically by iterating multiplication. The multiplicative approach also has been implemented in various microprocessors such as the IBM RS/6000 and AMD K7 processor [16] . Because the multiplier in the divider based on the multiplicative approach can be also used as the floating-point multiplier, the GPU, which has to perform both multiplication and division, can be implemented efficiently. We employed the the multiplicative approach to implement SFQ floating-point divider. As the multiplicative approach, the Newton-Raphson algorithm [14] and the Goldschmidt's algorithm [17] are well-known. We employ the Goldschmidt's algorithm for the SFQ divider because this algorithm is suitable for the pipelined SFQ logic circuit as we discuss later. We investigated the floating-point division of floating-point number defined by the IEEE standard 754 [18] . The floatingpoint number X is represented as X = (−1) X S × 2 X E × X F , where X S , X E and X F are sign, exponent, and significand bits respectively. The number of bits of the exponent and significand are represented as n e and n f , respectively. According to the IEEE 754, the standard format, (n e , n f ) = (5, 11), (8, 24) , (11, 53) , and (15, 113) are called half-, single-, double-and quadruple-precision floating-point numbers, respectively.
1051-8223 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Let the dividend, the divisor, and the quotient be Z, D, and Q, respectively. When we represent Z, D, and 
The sign, exponent, and the significand, q S , q E , and q F of Q, can be represented as
where n bias is the bias for exponent. The floating point is represented by the value is offset from the actual value by the exponent n bias to simplify a comparison of two exponents [18] . To calculate q F , we employ the Goldschmidt's division algorithm. The Goldschmidt's division algorithm is based on a nature of fraction that the value does not change when the same value is multiplied to both the denominator and the numerator. In the Goldschmidt's division algorithm, the quotient
where k is the number of iterations of multiplication,
Let us define d i and z i as
and
where d 0 and z 0 are denominator d F and numerator z F , respectively. By choosing x i to satisfy the following:
the quotient q F converges to 1 [3] . To shorten the conversion time of q F , x 0 , the initial value of x i , is set to be the approximation value of 1/d F [14] . Fig. 1 shows a flow of the Goldschmidt's division algorithm. Fig. 2 shows a flow-chart of the division based on Goldschmidt's algorithm. One can see that the two calculations represented by (7) and (8) can be performed by repeating simple multiplications. Moreover, multiplication represented by (7) and (8) can be performed independently. That means the data hazard does not occur in each multiplication branch, the multiplication can be performed efficiently by introducing pipelining. Therefore, Goldschmidt's division algorithm with iteration of simple multiplications is suitable for the SFQ circuit that has the latching function in logic gates and implementation of the pipelining without pipeline registers. Let assume x 0 to satisfy the following: where 0 is the error between x 0 and 1/d 0 , and p is the 0 represented by using the unit of bit. Since Goldschmidt's division algorithm converges to accurate quotient quadratically, when p = 8, bit accuracy of q F is improved to 7-bit, 15-bit, 31-bit, 63-bit and 127-bit after iteration.
III. CIRCUIT DESIGN AND PERFORMANCE EVALUATION
From equations (2), (3) and (4), floating-point number division can be calculated by using the exclusive-OR gate for sign, the adder/subtractor for exponent, and divider for significand. We designed a division circuit for the significand that calculates z F /d F based on the Goldschmidt's division algorithm as shown in Fig. 2 . Circuit components in the division circuit for the significand are the lookup table for inputting x 0 , the multiplier, 2's complement converter (2's comp conv.) for x i , and registers (Reg) that store the multiplication results of each iteration stage. We employ the bit-serial multiplier proposed in [8] that has less latency and good scalability shown in Fig. 3 . Fig. 4 shows the block diagram of the proposed bit-serial Goldschmidt's SFQ divider composed of circuit elements mentioned above. To evaluate the performance and the circuit scale of the SFQ divider, we designed the bit-serial divider using the cell library we developed [19] for the AIST 10 kA/cm 2 Nb advanced process 2 [20] . In order to reduce the propagation delay of the data, passive transmission lines with the characteristics of 3.5 Ω are used for wiring between each circuit component [21] , [22] . Because implementation of the large-scale SFQ lookup table using the current superconducting circuit fabrication process is difficult, use of the SFQ/CMOS hybrid lookup table [23] is assumed in the circuit design. Fig. 5 shows the layout of designed 4-bit SFQ divider composed of 8,091 JJs assuming the bit length of d and z are 4 and the accuracy of x 0 is 2 −3 (p = 3). When these bit length and accuracy are assumed, the accuracy of quotient q F would reach to 2 −11 after iteration of multiplication three times. Since the accuracy of the 2 −11 is acceptable for the practical application, we evaluated the latency of the divider assuming the 3 iterations. Fig. 6 shows the time scheduling of each calculation of the 4-bit input Goldschmidt's SFQ divider.
We have simulated the designed circuit with Cadence Verilog-XL simulator using the behavior model that defines setup time, hold time, and latency of each cell [24] . Fig. 7 shows the digital simulation results of the designed half-precision SFQ divider. In The simulation results show that designed divider operates correctly up to the clock frequency of 76.9 GHz. The latency of the 4-bit divider is 87 clocks as shown in Fig. 6 . The dc bias margin of the 4-bit divider is 80-125%. The clock frequency dependence of the dc bias margin is shown in Fig. 8 .
Based on the circuit divider design result, we estimated latency and the number of Josephson junctions (JJs) of the SFQ divider assuming p of 8 as a function of bit-length of the input data, which is the same as the accuracy of quotient. Fig. 9 shows the estimated latency and the number of JJs of the bitserial SFQ divider as a function of accuracy of the quotient. The order of latency and the number of JJs are O(nlogn) and O(n) respectively, where n is the bit length of the input data. From this estimation, when the circuit operates at the frequency of 50 GHz, single-and double-precision division can be performed with the latency of 5.3, 15.1 ns, respectively. To improve the latency of the divider more, using two multipliers is effective. As discussed in Section II, the two multiplication branches can be performed independently. In the case we adopt expanded architecture, the latency of the divider can be reduced by approximately 40% by using two multipliers though the circuit area of the divider increases. It cannot be reduced by 50% because multiplication can not be parallelized at the last iteration.
We also estimated the number of JJs to as the function of accuracy of quotient. Fig. 9(b) shows the dependence of the number of the JJs of the divider on the bit length of the input data. The order of the number of JJs are O(n) because the number of JJs of the multiplier is proportional to input bit length [8] . According to our estimation, the number of JJs required to implement singleand double-precision dividers are 13,594 and 26,518 respectively. These JJ number can be implemented using the current circuit fabrication process on one chip [25] . If we employ the parallel multiplication as mentioned above, the number of JJs of the double-precision divider is estimated to be 41,284.
IV. CONCLUSION
We investigated the floating-point bit-serial divider for SFQ logic based on Goldschmidt's division algorithm. We designed the 4-bit SFQ divider composed of one pipelined multiplier using the AIST 10 kA/cm 2 Nb advanced process. When the bit-length of the dividend and divisor are 4, accuracy of 2 −11 can be obtained by iterating multiplication by three cycles. Estimated latency and the number of JJs of the designed 4-bit SFQ divider are 87 clocks and 8091, respectively. According to estimation based on the circuit design of the 4-bit divider, the double-precision divider can be implemented with 26,518 JJs by following this architecture, which can be implemented on one chip using the current SFQ circuit fabrication technology.
