SUMMARY A rapid single-flux-quantum (RSFQ) 4-bit bit-slice multiplier is proposed. A new systolic-like multiplication algorithm suitable for RSFQ implementation is developed. The multiplier is designed using the cell library for AIST 10-kA/cm 2 1.0-µm fabrication technology (ADP2). Concurrent flow clocking is used to design a fully pipelined RSFQ logic design. A 4n × 4n-bit multiplier consists of 2n + 17 stages. For verifying the algorithm and the logic design, a physical layout of the 8 × 8-bit multiplier has been designed with target operating frequency of 50 GHz and simulated. It consists of 21 stages and 11,488 Josephson junctions. The simulation results show correct operation up to 62.5 GHz.
Introduction
Superconducting rapid single-flux-quantum (RSFQ) circuit technology [1] is expected to be a next generation circuit technology which enables ultra-high-speed computation with ultra-low-power consumption [2] . With the progress of RSFQ fabrication process technology, it has become possible to realize an RSFQ LSI including tens of thousands of Josephson junctions (JJs) [3] . The increase of the wiring layers combined with the passive transmission line (PTL) technology [4] further increases the circuit integration density.
An integer multiplier is one of the most important arithmetic circuits for high-speed processing. An RSFQ 4 × 4-bit and an 8 × 8-bit parallel multiplier have been designed and fabricated [5] , [6] . For practical applications, multipliers with longer operand length, e.g., 24, 32, or more, are desired. An RSFQ 24 × 24-bit bit-serial multiplier based on the systolic algorithm proposed in [7] has been designed and fabricated as a component of a single-precision floatingpoint multiplier [8] .
In general, an m × m-bit multiplication requires m A logic design of an RSFQ 32 × 32-bit parallel multiplier would consist of more than 70 thousands JJs using the cell library [9] AIST ADP2 fabrication process [10] , which is difficult to be implemented on a single die. On the other hand, an RSFQ m×m-bit bit-serial multiplier would have unacceptably high latency for several applications. We think bit-slice architecture can be a solution.
In this paper, we propose an RSFQ 4-bit bit-slice integer multiplier. The proposed multiplier is based on a newly developed systolic-like multiplication algorithm suitable for RSFQ implementation. A 4n × 4n-bit multiplier is mainly composed of n almost identical systolic cells. Its hardware complexity is much lower than that of a parallel multiplier. To the best of our knowledge, this is the first proposal of an RSFQ 4-bit bit-slice multiplier. Although we let the length of a bit-slice be four in this paper, we can design a bit-slice multiplier with any slice length in the same way.
In the proposed 4-bit bit-slice multiplier, each systolic cell generates four 4-bit slices of partial products, accumulates them with three 4-bit slices from the preceding cell through carry save additions, and produces three 4-bit slices to the succeeding cell. In an RSFQ logic design with concurrent flow clocking, each systolic cell is implemented as a 9-stage pipelined circuit. By overlapping the clock cycles for the latter 7 stages with those for the former 7 stages of the succeeding cell, a low latency is achieved.
For verifying the proposed algorithm and the logic design, a physical layout of the 8 × 8-bit (i.e., n = 2) 4-bit bitslice multiplier has been designed with target operating frequency of 50 GHz using the cell library for AIST 10-kA/cm 2 1.0-µm fabrication technology (the AIST ADP2).
The remainder of this paper is organized as follows. In the next section, the algorithm and architecture, and RSFQ logic design details of the proposed 4-bit bit-slice multiplier are described. In Sect. 3, a layout and simulation results of the 8 × 8-bit multiplier are shown. Finally, Sect. 4 summarizes our findings and concludes the paper.
4-bit Bit-Slice Multiplier

Algorithm and Architecture
We consider a 4n × 4n-bit multiplication, Z = X × Y, where n is a natural number,
Each of the 4n-bit multiplicand X and multiplier Y is divided into n slices of 4 bits each. The n pairs of operand slices are input one by one starting from the least significant
Copyright c
⃝ 2016 The Institute of Electronics, Information and Communication Engineers one. The 8n-bit product Z is in 2n 4-bit slices which are output one by one from the least significant one. The multiplier performs unsigned or signed integer multiplication. For unsigned integer multiplication, X is multiplied by each bit of Y to generate the partial products, which are added to get Z. Unlike unsigned integer multiplication, signed integer multiplication requires to invert the partial product bits multiplied by the sign bits of X and Y according to the following formula [7] :
Therefore, we need to design a control signal to generate the different partial products from unsigned integer multiplication. The multiplier is based on a newly developed systoliclike algorithm. Figure 1(a) shows a block diagram of a 4n×4n-bit 4-bit bit-slice multiplier. It is composed of n main cells and a final addition cell. As shown in Fig. 1(b) , a main cell consists of a 'partial product generator (PPG)' and a '4-4 accumulator', as well as registers, 'Reg X1', 'Reg X2', and 'Reg Y', for keeping two slices of X and a slice of Y, and D flip-flops for keeping signals. The PPG for the most significant cell (cell n−1 ) is slightly different from the others for handling the sign in signed multiplication. The cell also has an additional register for signal Sign which indicates signed multiplication. The final addition cell consists of a 4-bit bit-slice 3-to-2 compressor and a 4-bit bit-slice adder.
A multiplication is carried out through 3n + 1 logical (systolic) clock cycles. (We use 'logical cycle' in order to explain our multiplication algorithm. It is different from the clock cycle of the RSFQ design appearing later.)
The signal Start is fed to cell 0 at logical cycle 0 and forwarded to the next cell every logical cycle. (We count the logical cycle from 0.) Therefore, it reaches cell i at cycle i (i = 0 to n − 1). The i-th pair of the operand slices, i.e., y 4i+3 y 4i+2 y 4i+1 y 4i ] ), is input to the multiplier at cycle i. X i is fed to cell 0 and forwarded to the next cell in every two cycles. Y i is set to Reg Y of cell i by the Start signal and is kept there. The signal for signed multiplication Sign, which is 1 for signed multiplication, is fed to the multiplier at logical cycle n − 1 with the most significant operand slices, and is immediately split into two. One is fed to cell 0 and forwarded to the next cell every two cycles along with X n−1 . The other is sent to cell n−1 directly, and is set into the additional register by the Start signal. Signal Sign is used for the bit-complementation and generation of the correction terms −2 8n−1 and 2 4n in Eq. (1). The PPG of cell i generates four 4-bit slices of partial products at logical cycle 2i to 2i + n. At logical cycle 2i + j ( j = 0 to n), it generates the four 4-bit slices shown in the square in Fig. 2 . (Fig. 2 shows a dot diagram of the proposed multiplication. A dot presents a partial product bit or a bit of intermediate results or a final product bit.) Note that at this cycle, Reg X1 and Reg X2 hold X j−1 and X j , respectively. The PPG of each main cell generates four partial products (four rows of partial product dots in the same color in Fig. 2) .
The 4-4 accumulator of cell i sums up the four 4-bit partial product slices from the PPG and three 4-bit slices from cell i−1 (input from S in0 , S in1 , and S in2 ) through carry save additions, and produces three 4-bit slices (output to S out0 , S out1 , and S out2 ) at logical cycle i to 2i + n. Note that at logical cycle i to 2i-1, there are no inputs from the PPG.
In the final addition cell, the 4-bit bit-slice 3-to-2 compressor sums up the three 4-bit slices from cell n−1 and produces two 4-bit slices, and then, the 4-bit bit-slice adder sums up these two 4-bit slices and produces a 4-bit slice of the final product at logical cycle n + 1 to 3n.
RSFQ Logic Design
A fully pipelined synchronous RSFQ logic design of the proposed multiplier using concurrent flow clocking is considered. Namely, each pipeline stage consists of a row of RSFQ clocked logic gates. The basic RSFQ logic gates and flip-flops: AND, XOR, NOT, DFF, and NDRO (nondestructive read-out), and wiring elements: JTL (Josephson transmission line), SPL (splitter), CB (confluence buffer), and PTL in the cell library [9] for the AIST ADP2 [10] are used.
Reg Y is implemented using four NDROs. Reg X1 and Reg X2 are implemented using DFFs. PPG is implemented as a row of 16 AND gates. Figure 3 shows that a 4-4 accumulator consists of four 4-bit carry save adders each of which is a row of four full adders (FA's). In the figure, the upper side is the LSB side and the lower side is the MSB side. The three 4-bit slices from the preceding cell are input from S in0 , S in1 and S in2 . The produced three 4-bit slices are output to S out0 , S out1 and S out2 . A full adder is implemented using two AND gates, two XOR gates and a CB. In order to keep the carries from the most significant position of the corresponding slices and add them to the least significant position of the succeeding slices, the DFFs are required. Figure 4 shows a logic gate level circuit of a 4-4 accumulator with a PPG. As shown in the figure, a main cell consists of 9 stages. Since a full adder is with 2-stage, adjacent cells are aligned in two stages off. Namely, the clock cycles for the latter 7 stages are overlapped with those for the former 7 stages of the succeeding cell. Figure 5 shows a logic gate level circuit of the final addition cell. The 3-to-2 compressor is nothing but a carry save adder, i.e., a row of four full adders. To reduce the latency of the 4-bit bit-slice adder, a type of parallel prefix (or carry look-ahead) adder called Sklansky adder is used. It consists of 6 pipeline stages. The delay in the feedback loop for the carry signal to the succeeding slice is minimized using the technique developed in [11] .
The 4n × 4n-bit multiplier consists of 2n + 17 stages in total, where two stages for setting Y 0 to cell 0 , 2n + 7 for n main cells and 8 for the final addition cell. n pairs of multiplicand and multiplier slices are fed at the first to the n-th clock cycles, and 2n slices of the resultant product is output at the (2n + 18)-th to (4n + 17)-th clock cycles. The latency for a 4n × 4n-bit multiplication is 4n + 17 clock cycles (plus circuit delay). Namely, the most significant slice of the resultant product is output after 4n + 17 clock cycles from the input of the least significant slice pair of operands. The multiplier can carry out a 4n ×4n-bit multiplication every 2n clock cycles.
Layout and Simulation of an 8× 8-bit Multiplier
For verifying the algorithm and the logic design, a physical layout of an 8× 8-bit 4-bit bit-slice multiplier with SFQ-to-DC and DC-to-SFQ converters has been designed and simulated using the AIST ADP2 with target operating frequency of 50 GHz. Figure 6 shows the entire layout. The 8× 8-bit multiplier includes all the component blocks, i.e., the 4-4 accumulator, the PPG, the PPG for the most significant cell, the 3:2 compressor and the bit-slice adder.
In order to reduce the latency, PTLs are used for the wring between stages. The 8× 8-bit 4-bit bit-slice multiplier consists of 21 stages and 11,488 JJs. It occupies the area of 5.3 × 2.6 mm 2 . It has the bias current of 1,302 mA and the circuit delay of 460 ps at the bias voltage of 2.5 mV. Here, "circuit delay" is defined as the total delay of the logic gates from data-in to data-out, and has been estimated by static timing analysis. The latency is 20 ps/cycle × 25 cycles + 460 ps = 960 ps at 50 GHz.
We have simulated the designed circuit with Cadence Verilog-XL software using the behavior model of the cells provided with the cell library. The simulation results show that multiplier operates correctly up to 62.5 GHz. Figure 7 shows a correct result of simulation.
Conclusion
An RSFQ 4-bit bit-slice multiplier based on a new systoliclike algorithm has been proposed. A logic design of a 4n× 4n-bit multiplier using the AIST ADP2 includes 2n + 17 stages. We have verified the algorithm and the logic design by making physical design of an 8× 8-bit multiplier.
Comparing with the parallel architecture, the bit-slice approach simplifies the circuit complexity and reduces the hardware cost. We believe that the bit-slice processing is a practical solution for a multiplication with longer operand length. Although we have let the length of a bit-slice be four in this paper, we can design a bit-slice multiplier with any slice length in the same way. When we design an m × m-bit k-bit bit-slice multiplier, it will mainly consist of m/k cells and the amount of hardware of each cell is proportional to k 2 . Therefore, the amount of hardware of the multiplier will be O(km) instead of O(m 2 ).
