Abstract-We propose a pipelined parallel multiplier in phase-mode logic. The multiplier can be composed of combinations of gates which are the basic devices ofthe phasemode logic. Experimental operations of the ICF gate and the Adder cell for the multiplier are reported. The proposed multiplier has a Wallace-tree structure comprising trees of carry save adders for the addition of partial products. This structure has a regular layout, hence it is suitable for a pipeline scheme. In the final stage of multiplication, a fast carry lookahead adder is used for generating a multiplication result. Using a Verilog-HDL simulation, we show that the parallel multiplier with 2.5kA/cm2 Nb/AlOx/Nb junctions can operate over 10 GHz.
I. INTRODUCTION
ECENTLY, requirements of high-speed computation are increasing on the various areas such as nuclear science, weather forecasting, information processing of communication systems, molecular science, development of complex computer systems, etc. A single-flux-quantum (SFQ) logic has a great potential for such high-speed digital computations [1]- [3] . We have proposed phase-mode logic, which is an SFQ logic for a digital computation.
In this paper, we propose a pipelined parallel multiplier in phase-mode logic. Firstly, we report experimental results of the basic gate of phase-mode logic and the adder cell which are used for designing the multiplier. Secondly, a design of a cany lookahead (CLA) adder is described. This CLA adder plays an important role in the final stage of multiplication. Thirdly, we propose a multiplier having trees of cany save adders for the addition of partial products. This structure has a regular layout, hence it is suitable for a pipeline scheme. Finally, we discuss the performance of the multiplier. Using a Verilog-HDL simulation, we show that the parallel multiplier with 2.5kA/cm2 Nb/AlOx/Nb junctions can operate over 10 GHz. 
Figs. 2(a) and (b) show the microphotograph and the functional test result of the ICF gate, respectively. The gate is fabricated by using NEC standard 2.5kA/cm2 Nb/AlOx/Nb process. This result shows the proper operation of ICF gate. However, the measured bias margin is very narrow due to the dispersion of circuit parameters
B. Adder Cell
A serial input adder cell is achieved by feeding A output of an ICF gate back to Y input. Fig. 3 shows the adder cell using an ICF gate. Fig. 4 shows the low-speed test result of a fabricated adder cell. This result shows the proper operation of the adder cell. The measured bias margin is k7% which is lower than designed value(*39%). 111. CARRY LOOKAHEAD ADDER A carry lookahead (CLA) adder described in this section plays an important role in a final stage of multiplication. In phase-mode logic, a parallel ripple carry adder has been proposed [I] . A ripple carry adder has the simplest structure, however it has a weak point of an increase of the operating time which is proportional to the number of bits. Most of parallel-adder circuits using semiconductor devices are based on a carry-lookahead architecture. In the RSFQ logic, a fast pipelined parallel CLA using only two types of cells (inverter and D-flip-flop) has been proposed [5]. A CLA adder has regularity in its structure and therefore it is suitable for a pipelined scheme. In this section, we describe a design of the cany lookahead adder using the ICF gates and adder cells described in the previous section. The CLA adder includes three arithmetical blocks which are Preprocessing, Carry Lookahead, and Postprocessing. In the following sections, the timing of the system is controlled by a traditional method of Phase-Mode logic [I] . The system operates asynchronously by using one timing signal being attached to one word. This method has the disadvantage of relatively lower-speed operation than synchronized circuits. However, this method is simple and does not need much attention to timing design. Timing signals to each of the bits on a pipeline stage are provided by signal distribution trees.
A. Preprocessing
A(a,a,-~ *..a2al), B(b,b,.l .b2b1), and C(C,C.~ .--CZCI) denote N bits augend, addend, and carry, respectively. The Preprocessing block generates the P(propagate) signal and the G(generate) signal represented by following equations 
B. Carry Lookahead
This block generates carry signals by using P and G signals. The carry ci is represented by equation
In a block including continuous bits from b i t j to bit i, P and G signals can be defined as p,j and g,,. p l y signifies that a cany will propagate from bit j to bit i. Similarly g,, denotes that a carry is generated in at least one of the bit positions fromj to i inclusive and propagated to bit position i. The carry c,=g, o can be calculated efficiently by using the operator (A) introduced by Brent and Kung [6]. The operator is used: which is defined as
The calculations of P=P,. Pb and G=G,+Gb.P, can be achieved by using three ICF gates as shown in Fig.6 .
Various cany calculation methods using the A operator have been proposed. Fig.7 shows the some methods having a tradeoff between speed and number of circuit elements. Fig.7(a) shows the notations of the A operator cell and the dummy cell having 'only the data shiR function. 
has small circuit elements. Table I shows the comparison of these methods in a 32-bit CLA.
C . Postprocessing
A sum of bit i (si) is obtained by equation
Using q generated by Preprocessing, a sum can be calculated as ~i=qiOgi.~. This operation can be achieved by using Sum generator shown in Fig.8 . A. Generation of partial products An AND cell shown in Fig. 10 is used for generating a partial product of a multiplication. AND cells forms an AND array shown in Fig.11 . The AND amy sends SFQs (partial products) to each of bit lines serially. 
B. Addition of partial products
We propose a multiplier with a tree structure using a carry save adder(CSA). The CSA can be realized by using a full adder circuit with an additional adder cell to store the carry output as shown in Fig.12 . The CSA has usually three inputs and two outputs. The CSA proposed in this section can easily expand into the cell with more inputs as shown in Fig. 12(b) .
After a partial products input to the CSA, the result of the addition is sent to outputs by reset signal. Fig. 13 shows the CSA cell using an ICF gate. Fig.14 shows a 7-input CSA array which varies the addition of seven numbers into the one of three numbers. 
IV. PIPELINED PARALLEL MULTIPLIER C . Generation of multiplication result
After additions of partial products are executed in order, the addition of two numbers remains finally. Using the carry hokahead adder in the previous section, the multiplication Fig.15 shows an example of a 32x32-bit multiplier using this section, we describe a design ofthe multiplier with a Wdlm-tree structure [9]. This structure h a a simple and regular layout therefore can easily comprise the combinations of ICF gates and adder cells in the previous section. The multiplier includes three arithmetical blocks which are a generation of partial products, an addition of partial products, and a generation of the multiplication result. 
v . ESTIMATIONS OF MULTIPLICATION PERFORMANCE
The delay time of CSA is related to a time O(n) being proportional to a number of serial inputs and a time O(log(n)) depending on bit number of the adder. Accordingly, the processing time of one-stage CSA is represented by where n is a number of inputs, m is a bit number of the adder, T,, is a time interval between input pulses, and Tadd is a delay time of carry propagation per one bit. If the multiplier is designed by using a 3-input CSA array, the processing time per stage of the pipeline is minimum, therefore, the throughput is maximum. However, the tree structure is large and complicated because of an increase of the number of pipeline stages. On the other hand, if the number of inputs is increased, the scale of trees can be reduced by the decrease of pipeline stages. However, the throughput is decreased. Namely, it means a tl-adeoff between operation speed and integration scale. While we can not expect the maximum throughput, the integration scale of CSA trees can be decreased by using a 7-input CSA. Table 11 , Table I11 and Table IV shows the estimations of 32-bit multiplier without CLA block, integration scale, and estimations of Kogge-Stone CLA block, respectively. These estimations were carried out by Verilog-HDL simulations assuming 4 = 2.5kAicm2. A final multiplication result was obtained by a 64-bit Kogge-Stone CLA adder. The maximum throughput of the CLA is estimated to be over 15 GOPS by the Verilog-HDL simulation.
Hence, total throughput of the multiplier is limited by the throughput of CSA trees shown in Table 11 . As a result, a 32-
bit multiplication over 10 GHz can be achieved. VI. CONCLUSION We have proposed a pipelined parallel multiplier using phase-mode logic. The multiplier is comprised of combinations of ICF gates which are the basic devices of the phase-mode logic. Experimental operations of the ICF gate and the Adder cell for the multiplier have been confirmed. The proposed multiplier has a Wallace-tree structure comprising trees of cany save adders for the addition of partial products.
This structure has regularity in its layout, hence it is suitable for a pipelined scheme. On the final stage of multiplication, a fast cany lookahead adder is used for generating a multiplication result. Using a Verilog-HDL simulation, we have shown that the parallel multiplier with 2.5 W c m 2 Nb/AlOx/Nb junctions can achieve fast multiplication over 10 GHz.
