Phase-mode pipelined parallel multiplier by Onomi  Takeshi et al.
Phase-mode pipelined parallel multiplier
著者 Onomi  Takeshi, Yanagisawa  Kiyoshi, Seki 
Masashi, Nakajima  Koji
journal or
publication title







IEEE TRANSACTIONS ON APPLlhD SUPERCONDUCTIVITY, VOL. I I, NO. 1 ,  MARCH 2001 54 I 
P hase-Mode P ipelined Parallel Multiplier 
Takeshi Onomi, Kiyoshi Yanagisawa, Masashi Seki, and Koji Nakaj ima 
Abstract-We propose a pipelined parallel multiplier in 
phase-mode logic. The multiplier can be composed of  
combinations of gates which are the basic devices ofthe phase- 
mode logic. Experimental operations of the ICF gate and the 
Adder cell for the multiplier are reported. The proposed 
multiplier has a Wallace-tree structure comprising trees of 
carry save adders for the addition of partial products. This 
structure has a regular layout, hence it is suitable for a pipeline 
scheme. In the final stage of multiplication, a fast carry 
lookahead adder is used for generating a multiplication result. 
Using a Verilog-HDL simulation, we show that the parallel 
multiplier with 2.5kA/cm2 Nb/AlOx/Nb junctions can operate 
over 10 GHz. 
Index Term-multiplier, phase mode, pipeline, single flux 
quantum, 
I. INTRODUCTION 
ECENTLY, requirements of high-speed computation are 
increasing on the various areas such as nuclear science, 
weather forecasting, information processing of 
communication systems, molecular science, development of 
complex computer systems, etc. A single-flux-quantum (SFQ) 
logic has a great potential for such high-speed digital 
computations [1]-[3]. We have proposed phase-mode logic, 
which is an SFQ logic for a digital computation. 
In this paper, we propose a pipelined parallel multiplier in 
phase-mode logic. Firstly, we report experimental results of 
the basic gate of phase-mode logic and the adder cell which are 
used for designing the multiplier. Secondly, a design of a 
cany lookahead (CLA) adder is described. This CLA adder 
plays an important role in the final stage of multiplication. 
Thirdly, we propose a multiplier having trees of cany save 
adders for the addition of partial products. This structure has a 
regular layout, hence it is suitable for a pipeline scheme. 
Finally, we discuss the performance of the multiplier. Using a 
Verilog-HDL simulation, we show that the parallel multiplier 
with 2.5kA/cm2 Nb/AlOx/Nb junctions can operate over 10 
GHz. 
R 
11. ICF GATE AND ADDER CELL 
In this section, we describe the experimental test results of 
the basic gate and the adder cell which are the most basic 
Manuscript received Sept. 18,2000. This work was supported in part by 
Special Coordination Funds for promoting Science and Technology of the 
Science and Technology Agency of the Japanese Government. 
T.  Onomi is with L.E.I.S., Research Institute of Electrical Communication, 
Tohoku University, Sendai 980-8577 Japan (telephone: +81-22-217-5559, 
e-mail: onomi@riec.tohoku.acjp). 
K. Yanagisawa was with L.E.I.S., Research Institute of Electrical 
Communication, Tohoku University, Sendai 980-8577 Japan. He is now with 
Nippon Signal Co., (telephone: +81-48-859-2921, e-mail: yanagi- 
k@signal.co.jp). 
K. Nakajima is with L.E.I.S.. Research Institute of Electrical 
Communication, Tohoku University, Sendai 980-8577 Japan (telephone: 
+SI -22-21 7-5558, e-mail: hello@riec.tohoku.ac.jp). 
circuits of pipelined parallel adder and multiplier. 
A. ICF (INHIBIT controlled by fluxon) gate 
We have proposed ICF(INHIB1T controlled by fluxon) gate 
as the basic device of the phase-mode logic. Since the first 
proposal, some types of ICF gates have been proposed. In this 
section, we describe the ICF gate designed by RSFQ scheme. 
Fig. 1 shows the ICF gate and its Moore diagram. This gate 
has the same structure of D Flip-Flop with complementary 
outputs in RSFQ logic [4]. However, some modifications of 
the D Flip-Flop are needed to realize the perfect operation of 
the ICF gate. The modifications are the addition of Reset and 
C terminals. The operations of ICF gate are as follows: *If 
the internal state of the gate is “O”, that is, no SFQ is in the 
loop including 52, 53, 57, L2, and 56, an SFQ from X outputs 
to A terminal through 51,  52, 57, and 55. *If the internal state 
is “O”, an SFQ from Y is stored in the loop consisting of 52, 
53, 57, L2, and 56 after passing through 5 1 1  and 57. *If the 
internal state is “l”, an SFQ from X is combined with the 
stored SFQ, and outputs to B by 51,  53, 54, and 58 switching. 
*If the internal state is “I”,  an SFQ from Re is combined with 
the stored SFQ, and outputs to C by 510 and 56 switching. *If 





Fig.1. ICF gate. (a) Circuit diagram. The device parameters are Ic l  = 
0.35mA,lc2=0.28mA,Ic3=0.34mA,Ic4=lc5=0.25mA,Ic6= 0.31mA, 
Ic7 = Ic8 = 0.20m.4, Ic9 = 0.27mA, IclO = 0.12mA, LI  = 0.8pH, L2 = 0.7pH, 
L3 = L4 = O.lpH, L5 = 4.0pH, L6 = I SpH, L7 = 3.lpH, L8 = 2.9pH, L9 = 
1.2pH,L10= 3.OpH,Ibl = 0.23mA,Ib2= 0.19mA,andIb3 =0.2ImA. (b) 
Moore diagram. (c) Symbol. 
Figs. 2(a) and (b) show the microphotograph and the 
functional test result of the ICF gate, respectively. The gate is 
fabricated by using NEC standard 2.5kA/cm2 Nb/AlOx/Nb 
process. This result shows the proper operation of ICF gate. 
However, the measured bias margin is very narrow due to the 
dispersion of circuit parameters 
B. Adder Cell 
A serial input adder cell is achieved by feeding A output of 
an ICF gate back to Y input. Fig. 3 shows the adder cell using 
an ICF gate. Fig. 4 shows the low-speed test result of a 
1051-X223/01$10.00 0 2001 IEEE 
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on May 20,2010 at 07:40:19 UTC from IEEE Xplore.  Restrictions apply. 
542 
fabricated adder cell. This result shows the proper operation of 
the adder cell. The measured bias margin is k7% which is 
lower than designed value(*39%). 
Fig. 2. ICF gate fabricated by NEC 2.5kAicm’ Nb/AlOx/Nb standard process. 
(a) Microphotograph. (b) Low-speed test result. Three lower traces show 
output voltages of the SFQ/DC(T-flip-flop) pulse detectors. 
,. 
REI IC (c) 
Fig. 3. Adder cell. (a) Circuit diagram. (b) Moore diagram. (c) Symbol. 
Fig. 4. Low-speed test result of the fabricated adder circuit 
111. CARRY LOOKAHEAD ADDER 
A carry lookahead (CLA) adder described in this section 
plays an important role in a final stage of multiplication. In 
phase-mode logic, a parallel ripple carry adder has been 
proposed [I]. A ripple carry adder has the simplest structure, 
however it has a weak point of an increase of the operating 
time which is proportional to the number of bits. Most of 
parallel-adder circuits using semiconductor devices are based 
on a carry-lookahead architecture. In the RSFQ logic, a fast 
pipelined parallel CLA using only two types of cells (inverter 
and D-flip-flop) has been proposed [5]. A CLA adder has 
regularity in its structure and therefore it is suitable for a 
pipelined scheme. In this section, we describe a design of the 
cany lookahead adder using the ICF gates and adder cells 
described in the previous section. The CLA adder includes 
three arithmetical blocks which are Preprocessing, Carry 
Lookahead, and Postprocessing. In the following sections, the 
timing of the system is controlled by a traditional method of 
Phase-Mode logic [I] .  The system operates asynchronously by 
using one timing signal being attached to one word. This 
method has the disadvantage of relatively lower-speed 
operation than synchronized circuits. However, this method is 
simple and does not need much attention to timing design. 
Timing signals to each of the bits on a pipeline stage are 
provided by signal distribution trees. 
A. Preprocessing 
A(a,a,-~ *..a2al), B(b,b,.l .b2b1), and C(C,C.~ .--CZCI) denote N 
bits augend, addend, and carry, respectively. The 
Preprocessing block generates the P(propagate) signal and the 
G(generate) signal represented by following equations 
Sf = a, ’ b, (1) 
p ,  = U ,  Q3 b, (or p ,  = U ,  + b,).  ( 2 )  
Namely, g,=I represents the generation of the carry c, , and 
p,=l represents the generation of the carry c, when c,-i= 1. P and 
G signals can be generated by a PG generator shown in Fig. 5. 
Q signal (q,=a,Ob,) in Fig.5 is used for generating Sum in the 
Postprocessing block. 
h, 
Fig. 5. PG generator. 
B. Carry Lookahead 
This block generates carry signals by using P and G signals. 
The carry ci is represented by equation 
c. 1 = g .  1 + p . .  1 c .  1-1’ ( 3 )  
In a block including continuous bits from b i t j  to bit i ,  P and 
G signals can be defined as p, j  and g,,. p l y  signifies that a cany 
will propagate from bit j to bit i. Similarly g,, denotes that a 
carry is generated in at least one of the bit positions fromj to i 
inclusive and propagated to bit position i. The carry c,=g, o can 
be calculated efficiently by using the operator (A) introduced 
by Brent and Kung [6] .  The operator is used: 
which is defined as 
p. ,  . = p .  
The calculations of P=P,. Pb and G=G,+Gb.P, can be achieved 
by using three ICF gates as shown in Fig.6. 
Various cany calculation methods using the A operator 
have been proposed. Fig.7 shows the some methods having a 
tradeoff between speed and number of circuit elements. 
Fig.7(a) shows the notations of the A operator cell and the 
dummy cell having ‘only the data shiR function. Fig.7@) and 
(c) are Sklansky’s method [7] and Kogge and Stone’s method 
[SI, respectively. These methods have small stages of pipeline, 
hence they can achieve high-speed carry calculations. On the 
other hand, Brent and Kung’s method [6] shown in Fig.7(d) 
(Pi:j  9gi:j 1 = (Pi:k+l, g i : k + l  ) A ( P k + l : , j ’ g k + l : j  (4) 
1.1 r : k + l P k + l : j  and gi:,j = g i : k + l  + Pi:k+lgk:j. (5) 
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on May 20,2010 at 07:40:19 UTC from IEEE Xplore.  Restrictions apply. 
543 
has small circuit elements. Table I shows the comparison of 
these methods in a 32-bit CLA. 
C . Postprocessing 
A sum of bit i (si) is obtained by equation 
si = pi 63 ci-, . (6 )  
Using q generated by Preprocessing, a sum can be calculated 
as ~i=qiOgi.~. This operation can be achieved by using Sum 
generator shown in Fig.8. 
Fig. 6.  A operator cell. 
Fig. 7. Computation methods of CLA. (a) Notation. A filled dot and a circle 
denote a A operator cell and a shift operator cell, respectively. (b) Sklansky's 
method. (c) Kogge-Stone's method. (d) Brent-Kung's method. 
TABLE I 
COMPARISON OF 32-s r~  CLA ALGORITHMS 
Scheme Deoth of tree 0 
Sklansky 5 80 80 
Kogge-Stone 5 129 31 
Brent-Kung 9 57 199 
Fig. 8. Sum generator. 
A. Generation of partial products 
An AND cell shown in Fig. 10 is used for generating a 
partial product of a multiplication. AND cells forms an AND 
array shown in Fig.11. The AND amy sends SFQs (partial 
products) to each of bit lines serially. 
IN A 
Fig. 10. AND cell. (a)Symbol of the cell using ICF gate. (b)Notation. 
(c)Moore diagram. 
A $  A A A ,  
AND Array 
Fig. 1 I .  AND array(4-bit) 
B. Addition of partial products 
We propose a multiplier with a tree structure using a carry 
save adder(CSA). The CSA can be realized by using a full 
adder circuit with an additional adder cell to store the carry 
output as shown in Fig.12. The CSA has usually three inputs 
and two outputs. The CSA proposed in this section can easily 
expand into the cell with more inputs as shown in Fig. 12(b). 
After a partial products input to the CSA, the result of the 
addition is sent to outputs by reset signal. Fig. 13 shows the 
CSA cell using an ICF gate. Fig.14 shows a 7-input CSA 
array which varies the addition of seven numbers into the one 
of three numbers. 
Fig. 12. (n,m)CSA. n and m denote input number and output number, 
respectively. (a) (3,2)CSA. (b) (7,3)CSA. 
(b) R K u r  ( c )  
Fig. 13. CSA cell. (a)Symbol of the cell using ICF gate. (b)Notation. 
(c)Moore diagram. 
IV. PIPELINED PARALLEL MULTIPLIER C . Generation of multiplication result 
After additions of partial products are executed in order, the 
addition of two numbers remains finally. Using the carry 
hokahead adder in the previous section, the multiplication 
Fig.15 shows an example of a 32x32-bit multiplier using 
this section, we describe a design ofthe multiplier with a 
Wdlm-tree structure [9]. This structure h a  a simple and 
regular layout therefore can easily comprise the combinations 
of ICF gates and adder cells in the previous section. The 
multiplier includes three arithmetical blocks which are a 
generation of partial products, an addition of partial products, 
and a generation of the multiplication result. 
be Obtained. 
(793) CSA. 
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on May 20,2010 at 07:40:19 UTC from IEEE Xplore.  Restrictions apply. 
544 
A-br 
A . b  




Fig. 14. 7-input CSA array 
1 2  CSA { :::“:11:111:11::I::I::I::’::p::l::l:~’~’’~ + 
Fig. 15. Scheme of 32-bit multiplier using 7-input CSA array. Fine horizontal 
lines show the partitions of the input of next CSA array. Thick horizontal lines 
show the partitions of the operation of the multiplication. The final stage of 
the CSA array is connected to the CLA adder explained in section 111. 
CLA 
v .  ESTIMATIONS OF MULTIPLICATION PERFORMANCE 
The delay time of CSA is related to a time O(n) being 
proportional to a number of serial inputs and a time O(log(n)) 
depending on bit number of the adder. Accordingly, the 
processing time of one-stage CSA is represented by 
where n is a number of inputs, m is a bit number of the adder, 
T,,  is a time interval between input pulses, and Tadd is a delay 
time of carry propagation per one bit. If the multiplier is 
designed by using a 3-input CSA array, the processing time 
per stage of the pipeline is minimum, therefore, the 
throughput is maximum. However, the tree structure is large 
and complicated because of an increase of the number of 
pipeline stages. On the other hand, if the number of inputs is 
increased, the scale of trees can be reduced by the decrease of 
pipeline stages. However, the throughput is decreased. Namely, 
it means a tl-adeoff between operation speed and integration 
scale. While we can not expect the maximum throughput, the 
integration scale of CSA trees can be decreased by using a 7- 
input CSA. Table 11, Table I11 and Table IV shows the 
estimations of 32-bit multiplier without CLA block, 
integration scale, and estimations of Kogge-Stone CLA block, 
respectively. These estimations were carried out by Verilog- 
HDL simulations assuming 4 = 2.5kAicm2. A final 
multiplication result was obtained by a 64-bit Kogge-Stone 
CLA adder. The maximum throughput of the CLA is 
estimated to be over 15 GOPS by the Verilog-HDL simulation. 
Hence, total throughput of the multiplier is limited by the 
throughput of CSA trees shown in Table 11. As a result, a 32- 
T =  nTn +mTadd, (6) 
bit multiplication over 10 GHz can be achieved. 
TABLE I I 
ESTIMATIONS OF 32-BIT MULTIPLIER WITHOUT CLA BLOCK 
Throughput 
(GOPS : gigaoperation 
per second) 
number Maximum of input CSA Depth of the tree 
3 9 13.3 
7 5 6.7 
TABLE111 
INTEGRATION SCALE 
Block Number of ICF gates 
AND Array 1024 
32-input CSA Tree 561 
3-input CSA Tree 2541 
7-input CSA Tree 1110 
TABLE IV 
ESTIMATIONS OF KOGGE-STONE CLA BLCCK 










We have proposed a pipelined parallel multiplier using 
phase-mode logic. The multiplier is comprised of 
combinations of ICF gates which are the basic devices of the 
phase-mode logic. Experimental operations of the ICF gate 
and the Adder cell for the multiplier have been confirmed. The 
proposed multiplier has a Wallace-tree structure comprising 
trees of cany save adders for the addition of partial products. 
This structure has regularity in its layout, hence it is suitable 
for a pipelined scheme. On the final stage of multiplication, a 
fast cany lookahead adder is used for generating a 
multiplication result. Using a Verilog-HDL simulation, we 
have shown that the parallel multiplier with 2.5 W c m 2  
Nb/AlOx/Nb junctions can achieve fast multiplication over 10 
GHz. 
REFERENCES 
K. Nakajima, H. Mizusawa, H. Sugahara, and Y. Sawada, “Phase- 
mode Josephson computer System,” IEEE Trans. Appl. Superconduct., 
K. K. Likharev and V. K. Semenov, “RSFQ logichnemory family: A 
new Josephson-junction technology for sub-terahertz-clock-frequency 
digital systems,” IEEE Trans. Appl. Superconduct., vol. 1, pp.3-28, 
March 1991. 
M. Hosoya, W. Hioe, J. Casas, R. Kamikawai, Y. Harada, Y. Wada, H. 
Nakane, R. Suda, and E. Goto, “Quantum flux parametron: A single 
quantum flux device for Josephson supercomputer,” IEEE Trans. Appl. 
Superconduct., vol.1, pp.77-89, June 1991. 
A. F. Kirichenko, V. K. Semenov, Y. K. Kwong, and V. Nandakumar, 
“4-bit rapid single-flux-quantum decoder,” IEEE Trans. Appl. 
Superconduct,, vo1.5, pp.2857-2860, June 1995. 
P. Bunyk and P. Litskevitch, “Case study in RSFQ design: Fast pipelined 
Parallel Adder,” IEEE Trans. Appl. Superconduct., vo1.9, pp.3714-3720, 
June 1999. 
R. Brent and H. Kung, “A regular layout for parallel adders,” IEEE 
Trans. Comput., v0l.C-31, pp.260-264, March 1982. 
J. Sklansky, IRE Trans. Electronic Comput., vo1.9, p.226, 1960. 
P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient 
solution of a general class of recurrence equations,” IEEE Trans. 
Comput., v0l.C-22, pp.786-792, August 1973. 
C. S .  Wallace, “A suggestion for a fast multiplier,” IEEE Trans. 
Electron. Comp., vol. EC-13, pp. 14-17, Feb. 1964. 
~01.1, pp.29-36, March 1991. 
Authorized licensed use limited to: TOHOKU UNIVERSITY. Downloaded on May 20,2010 at 07:40:19 UTC from IEEE Xplore.  Restrictions apply. 
