A rapid single-flux-quantum (RSFQ) truncated multiplier based on bit-level processing is proposed. In the multiplier, two operands are transformed to two serialized patterns of bits (pulses), and the multiplication is carried out by processing those bits. The result is obtained by counting bits. By calculating in bit-level, the proposed multiplier can be implemented in small area. The gate level design of the multiplier is shown. The layout of the 4-bit multiplier was also designed. key words: rapid single flux quantum circuit, truncated multiplier, pulse logic
Introduction
Superconducting computing devices have been considered as potentially alternative devices of mainstream semiconductor computing devices [1] . The superconducting rapid single-flux-quantum (RSFQ) circuit technology [2] is a promising digital circuit technology for high-speed and lowpower operations.
In RSFQ logic circuit design, bit-serial or bit-slice processing has been used for arithmetic circuits than parallel processing which consumes larger circuit area. For example, a bit-serial adder and a bit-serial multiplier have been proposed [3] , [4] . Designing layouts of large RSFQ logic circuits is a hard task because timing design of large circuits is tough. Simple and compact designs of arithmetic circuits are desired especially for multipliers which consume large area.
In this brief, we propose an RSFQ truncated multiplier based on bit-level processing. Generally, truncated multipliers, which discard lower part of partial product bits of a complete multiplier to save circuit area, are realized as parallel processing circuits. The proposed truncated multiplier processes in bit-level to realize small circuit area. In the multiplier, each operand is transformed to a serialized pattern of bits (pulses). The multiplication is done for those bits on two lines, and the multiplication result is obtained by counting bits of the same weight for 2 n − 1 cycles in n-bit multiplication.
In bit-serial processing, each bit in n-bit operands fed serially has its corresponding weight. On the other hand, in the proposed multiplier, n-bit operands fed in parallel are Manuscript converted into (2 n − 1)-bit bit-patterns and each bit in the patterns has the same weight. The proposed multiplier takes longer time for processing than bit-serial multipliers, however, the multiplier is simple and can be realized in compact area.
The proposed truncated multiplier is suitable for applications which tolerate small error. Recently, hardware accelerators for neural network processing such as [5] attract attention. It is well known that inference using neural networks can be carried out in low precision. 8-bit data type has been used in the accelerator in [5] , and support for 4bit data type has been added in NVIDIA Turing architecture GPUs [6] and AMD Vega architecture GPUs. The proposed multiplier will be suitable for such applications.
We designed a 4-bit layout of the proposed multiplier with the cell library developed for AIST advanced process (ADP2) [7] . Its functionality was evaluated by logic simulation. Its maximum absolute error in its multiplication result was also evaluated.
Preliminaries

Truncated Multiplication
We consider n-bit unsigned truncated multiplication of multiplicand X: [0.x 1 · · · x n ] 2 and multiplier Y: [0.y 1 · · · y n ] 2 . We let the resultant product be Z: [0.z 1 · · · z n ] 2 . X, Y, and Z are fixed-point numbers (0(= [0.0 · · · 0] 2 ) ≤ X, Y, Z ≤ 1 − 2 −n (= [0.1 · · · 1] 2 )), and each of x i , y i , and z i is either 0 or 1.
In truncated multiplication, lower part of partial product bits is discarded. We show partial product bits in Fig. 1 . The upper bits enclosed by the dashed lines, whose weights are larger than 2 −n−1 , are summed up. The middle bits with weight 2 −n−1 are used for compensating the result, and the The result of the truncated multiplication is represented as follows:
where f is an error-compensation function.
In this brief, f (b 1 , . . . , b n ) = (b 1 +· · ·+b n ) is considered as the compensation function. The compensation using the function is equivalent to rounding each partial product P i = X · y i · 2 −i to its nearest value.
RSFQ Circuits
In RSFQ circuits, voltage pulses are used to represent logic values. Each basic logic gate, such as AND, OR, and XOR, has a clock input terminal as shown in Fig. 2 (a) and works synchronized with clock pulses. When a pulse arrives at a data input of a gate during an interval between adjacent clock pulses, the input value corresponding to the interval is "1" as shown in Fig. 2 (b) . If no pulse arrives during the interval, the input value is "0". It is prohibited to feed plural pulses for a data input of a basic logic gate during an interval. The output of a gate is synchronized with the clock pulse.
In addition to basic logic gates, several special gates exist as shown in Fig. 3 . In the figure, the symbol and the pulse-transferring finite state machine (PTFSMs) [8] of each gate are shown. The non-destructive read-out (NDRO) gate has two internal states, i.e., S T 0 and S T 1, as shown in Fig. 3 (a) . It outputs a pulse at dout only when its internal state is S T 1 and a pulse arrives at its clk terminal. The T1 gate in Fig. 3 (b) works like a counter of pulses. When internal state of a T1 gate is S T 1, it outputs a pulse at carry or sum terminal once a pulse arrives at din or clk terminal, respectively. The confluence buffer (CB) in Fig. 3 (c) merges pulses on its two inputs into its output.
RSFQ Truncated Multiplier Based on Bit-Level Processing
Structure
We propose an RSFQ truncated multiplier based on bit-level processing. We show its structure in Fig. 4 . It consists of a pattern generator, two bit generators, an AND gate, and a pulse counter.
The pattern generator has n-bit pattern output S : (s 1 , s 2 , · · · , s n ). It outputs one of 2 n − 1 patterns except the all-0 pattern (or 2 n patterns including the all-0 pattern) to S exhaustively without overlap in each clock cycle of a period of 2 n − 1 (or 2 n ) cycles. We let the number of clock cycles in a period be T . T is 2 n − 1 or 2 n .
The structure of the bit generator is shown in Fig. 5 . Its inputs are n-bit pattern R: (r 1 , r 2 , · · · , r n ), n-bit operand Q: (q 1 , q 2 , · · · , q n ), one-bit clock input clk, and one-bit reset input rst. Its output is one-bit signal b. An operand of the multiplier is fed to Q. b is calculated according to the operand value and an input pattern of R fed from the pattern generator. The output of the pattern generator S is connected to the two bit generators differently as shown in Fig. 4 . s i is connected to r i of the bit generator for Y, and is connected to r n+1−i of the bit generator for X. The input operands X and Y are treated as n-bit patterns (x 1 , x 2 , · · · , x n ) and (y 1 , y 2 , · · · , y n ), respectively. x i is connected to q i of the bit generator for X, and y i is connected to q i of the bit generator for Y. Once operand values of the multiplier are fed to the bit generators, those values are latched in the generators. rst of the generator is used for resetting the latches at the beginning of new calculation.
Each bit generator converts an operand into a serialized pattern of bits (pulses). The AND gate computes logical AND of bits on two lines. Weight of each bit (pulse) from the gate is 1 ulp. The pulse counter counts up pulses from the gate for T cycles, and outputs the counted result every T cycles. The counted result is the multiplication result. 
Calculation
We show the calculation of the multiplier. At first, the function of the bit generator is shown. Then, the calculation of the generators and the AND gate is shown.
In the bit generator of Fig. 5 , NDRO gates in the first row are used to generate part of w h signals (1 < h ≤ n). For those NDRO gates, the orders of pulse arrivals are represented by inequalities [9] . Pulses distributed from clk terminal of the generator are fed into set terminals of those NDROs to set their internal state at first. Then, the NDRO for w h receives (r 1 + · · · + r h−1 ) pulses at its rst terminal through CBs. Finally, the NDRO for w h may receive a pulse fed to r h at its clk terminal, and it outputs a pulse according to its internal state. The value of signal w h is described as follows:
Note that, among all w h signals, at most one signal takes "1", and the others are "0". In a period of T cycles, w h takes "1" in 2 (n−h) cycles (1 ≤ h ≤ n).
The internal states of the NDROs in the second row of the bit generator are set by operand Q beforehand. Each NDRO in the second row outputs a pulse (bit) according to its internal state when it receives a pulse at its clk terminal. Among those NDROs, at most one NDRO outputs a pulse at each clock cycle. CBs are used for calculating logical OR of the outputs instead of OR gates to save circuit area. The output of the generator is represented as follow:
b takes "1" in h≤n (q h · 2 (n−h) ) cycles in a period of T cycles. For example, if q 1 = · · · = q n = 1, b takes "1" in 2 n − 1 (=q 1 ·2 n−1 +· · ·+q n ·2 0 ) cycles in a period of T cycles. By the bit generators, X and Y are converted to bit-patterns and the number of "1" in the bit-patterns corresponds to magnitude of them. Because the bit generator for X and the bit generator for Y are connected to the pattern generator differently, outputs of the bit generators are calculated differently. The AND gate in Fig. 4 receives outputs of the two bit generators. The output of the AND gate p (= by ∧ bx) is represented as follows:
The above formula indicates that if x j = y i = 1 and i + j ≤ n + 1, the AND gate outputs "1" when the output of the pattern generator S is as follows:
where * denotes "don't care". Among different pairs of j and i, S j,i does not overlap each other.
The pattern generator feeds all n-bit patterns other than the all-0 pattern. If x j = y i = 1 and i + j ≤ n, the AND gate outputs at least 2 n−(i+ j) pulses in a period of T cycles because there are n − (i + j) bits of don't cares in S j,i . Thus, the number of pulses the AND gate outputs in a period of T cycles is described as follows:
i+ j≤n x j y i 2 n−(i+ j) + i+ j=n+1
x j y i because any pair of different S j,i has no overlap. The former term corresponds to the former term in Formula (1), and the latter one corresponds to the compensation function. Therefore, the counted result of the pulse counter corresponds to the result of the truncated multiplication.
To show the operation of the truncated multiplier visually, we consider a 4-bit design as an example. The Karnaugh maps of the bit generators for bx and by with respect to the output of the pattern generator S are shown in Figs. 6 (a) and (b), respectively. In the maps, the operand value is transformed as the number of ones. 2 4−i grids corresponding to x i or y i are rounded by broken lines. 
Detailed Design
For the pattern generator, a linear feed-back shift register (LFSR) could be used. For given n, an LFSR whose period T is 2 n − 1 can be designed.
We show a detailed design of the pulse counter in Fig. 7 as an example. Each T1 counts up its input pulses, and outputs a carry pulse every two input pulses. Once we feed a pulse for f inish, the result preserved as the internal states of the T1s is obtained at circuit outputs z n , z n−1 , . . . , z 1 . To obtain the result, w n signal in the bit generator can be utilized to feed a pulse for f inish terminal because it feeds a pulse every T cycles.
Evaluation Results
We designed a layout of a 4-bit design of the proposed truncated multiplier. An LFSR was used as the pattern generator and the pulse counter shown in Sect. 3.3 was used in the layout. We used Cadence Virtuoso and the cell library designed for AIST advanced process (ADP2) [7] .
The layout of the multiplier is shown in Fig. 8 . Its circuit area is 0.57 mm 2 (0.45 × 1.26 mm 2 ), and the number of Josephson junctions (JJs) is 996. By the logic level simulation considering delay of gates using Cadence Verilog-XL, the functionality of the circuit was verified. It was estimated to work at high-frequency up to 40 GHz. It outputs a multiplication result every 15 (= 2 4 − 1) cycles.
A bit-serial design of a multiplier was proposed in [4] . In [4] , integer multiplication is carried out internally using systolic processing elements (PEs). Each PE consumes 639 JJs, and 2,556 JJs are necessary for a 4-bit multiplier. Therefore, the number of JJs of the proposed design is smaller than that of the bit-serial design.
The error in the multiplication result of the 4-bit design was evaluated by numerical simulation. Its maximum absolute error is 1.0625 ulp. When multiplication is carried out normally and its 4-bit result is derived by rounding toward 0, i.e., the result is derived by simply cutting the lower bits of the true multiplication result, the maximum absolute error is 0.9375 ulp. Therefore, the error of the proposed multiplier is not large. 4-bit data type is supported in the state-of-the-art semiconductor GPUs [6] now for machine-learning applications using neural networks. The proposed multiplier would be useful for such applications.
Conclusion
We proposed a truncated multiplier for RSFQ circuits. The multiplier transforms its binary operands into two serialized patterns of pulses, and its result is obtained by counting pulses of the same weight. By the bit-level processing, it can be implemented in small circuit area. The layout of a 4-bit design was shown, and was estimated to work at highfrequency up to 40 GHz.
