20GHz Operation of an Asynchronous Wave-Pipelined RSFQ Arithmetic-Logic Unit  by Filippov, Timur V. et al.
 Physics Procedia  36 ( 2012 )  59 – 65 
1875-3892  © 2012 Published by Elsevier B.V. Selection and/or peer-review under responsibility of the Guest Editors. 
doi: 10.1016/j.phpro.2012.06.130 
Superconductivity Centennial Conference 
20 GHz operation of an asynchronous wave-pipelined RSFQ 
arithmetic-logic unit 
Timur V. Filippova*, Anubhav Sahua, Alex F. Kirichenkoa, Igor V. Vernika, 
Mikhail Dorojevetsb, Christopher L. Ayalab, and Oleg A. Mukhanova 
aHYPRES, Inc, 175 Clearbrook Road, Elmsford, NY 10523, USA 
bDepartment of Electrical and Computer Engineering, Stony Brook University, Stony Brook, NY 11794, USA 
 
Abstract 
We have designed and tested at high frequency an RSFQ-based Arithmetic-Logic Unit (ALU), the critical component of an 8-bit 
RSFQ processor datapath. The ALU design is based on a Kogge-Stone adder and employs an asynchronous wave-pipelined 
approach scalable for wide datapath processors. The 8-bit ALU circuit was fabricated with HYPRES’ standard 4.5 kA/cm2 process 
and consists of 7,950 Josephson junctions, including input and output interfaces. In this paper, we present chip design and high-
speed test results for the 8-bit ALU circuit. 
 
© 2011 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of Horst Rogalla and Peter Kes. 
Keywords: RSFQ; ALU; microprocessor, datapath.  
 
1. Introduction 
A high-performance arithmetic-logic unit (ALU) is a fundamental building block for any special- or general-
purpose processor. The reported ALU is a key processing component for the RSFQ-based [1] 8-bit processor datapath 
[2]. This is the first attempt to build a superconductor parallel processor in contrast to the bit-serial approaches [3, 4]. 
The ALU design is based on Kogge-Stone adder (KSA) [5]. A set of logic operations is integrated into the adder 
structure. The ALU is switched between arithmetic and logic operations by control signals. A similar approach to 
build an adder-based ALU was reported in [6]. However, that ALU was based on a simple ripple-carry adder and, 
therefore, was hardly scalable to a large number of bits. 
The current ALU employs a wave-pipeline synchronization approach [7]. According to this approach, a pipeline 
stage is allowed to start its operation on two independent data operands as soon as both operands arrive. There is no 
clock pulse used to advance the computation from one stage to another. Instead, a clock pulse that follows data is used 
to reset cells in the stage to make it ready to process the next data wave. This type of synchronization makes it 
different from the previous RSFQ-based pipeline ripple-carry adder [6] and KSA [8], where a co-flow timing 
technique was used to clock data throughout the entire adder requiring a clock distribution tree for every stage. 
We have already reported low-frequency functionality test results of the 8-bit ALU in [9]. This paper focuses on 
high-speed test results. 
 
* Timur V. Filippov, E-mail address: tfil@hypres.com 
Available online at www.sciencedirect.com
 2012 Published by Elsevier B.V. Selection and/or peer-review under responsibility of the Guest Editors.
Open access under CC BY-NC-ND license.
Open access under CC BY-NC-ND license.
60   Timur V. Filippov et al. /  Physics Procedia  36 ( 2012 )  59 – 65 
 
 
Fig. 1. An 8-bit ALU: (a) block-diagram; (b) microphotograph of the chip 
2. Design of the 8-bit ALU 
The block-diagram of the 8-bit ALU is shown in Fig. 1a and consists of 5 stages formed by four building blocks in 
accordance with the KSA algorithm [5]. All four blocks and their functions are listed in Fig. 2. The described ALU 
can be switched between arithmetic and logic operations by applying control signals to the INIT blocks. All control 
signals and corresponding instructions are listed in Table 1. 
While performing arithmetic operations, the INIT blocks produce bit-wise generate (G), propagate (P), and partial 
sum (pi) signals under control of Ready (R) pulses. The INIT blocks also route these signals to group prefix stages 
(three stages in the case of the 8-bit adder) formed by STR and REG blocks. The pi signals propagate only inside the 
bit-slice while G, P, and R pulses are also copied into other bit-slices in accordance with the KSA algorithm. The last 
stage of SUM blocks completes the summation. While executing logic operations, the INIT blocks perform bit-wise 
logic operations and route results to outputs inside the bit-slice through STR, REG and SUM blocks. 
Fig. 3 shows schematics of the blocks. The most complex block (INIT) is shown in Fig. 3a. It consists of the 
following logic cells: a D flip-flop (D) [1], a dual-port D flip-flop (D2) [10], a D flip-flop with complementary outputs 
(DC) [1], an XOR cell [1], and a dynamic AND cell [11]. The top layer of the INIT block is formed by D flip-flops to 
store control signals ctrl_xor and ctrl_add, and XOR cells to store direct or inverted (depending on inv_a/inv_b 
signals) input data. All cells of the top layer are clocked by a Ready pulse. Then, P and G signals are calculated by 
XOR and AND cells of the second layer and buffered by D2 cells of the third layer. In the presence of ctrl_add signal, 
the results are sent by the R signal to the group prefix stages. Note that G, P, and R signals are split into two copies to 
propagate inside the bit-slice (v) and be routed to the other bit-slice (h). The P signal is replicated as the pi signal for 
propagation inside the bit-slice. 
 
 
Fig. 2. ALU building block symbols and their functions 
 Timur V. Filippov et al. /  Physics Procedia  36 ( 2012 )  59 – 65 61
Table 1. ALU instruction set 
ALU Operation ctrl_add ctrl_xor inv_a inv_b 
ADD 1 x 0 0 
ADD-Invert-A 1 x 1 0 
ADD-Invert-B 1 x 0 1 
ADD-Invert-A-and-B 1 x 1 1 
AND 0 0 0 0 
NOR 0 0 1 1 
Set all bits to “1” 0 0 1 1 
AND-Invert-A 0 0 1 0 
AND-Invert-B 0 0 0 1 
XOR 0 1 0 0 
XNOR 0 1 0 1 
NOP 0 0 0 0 
 
In the presence of the ctrl_add signal, the ctrl_xor signal is ignored. (That is why the ctrl_xor signal is not specified 
in Table 1, if the ctrl_add is equal to 1.) In the absence of a ctrl-add signal, the INIT block is controlled by ctrl_xor 
signal switching between AND and XOR logic operations. This is done by two asynchronous AND cells in the fourth 
layer of the INIT block. Then, the data are not routed into the group prefix stages but go along the bit-slice. 
Note that the top layer of INIT block consisting of XOR cells inverts input data with the control signals inv_a and 
inv_b. This provides a subtraction operation and increases the variety of implemented instructions. 
The STR block (Fig. 3b) is based on a resettable Muller C-element. Actually, this is a resettable asynchronous 
AND cell, implemented as a truncated version of a half-adder [12, 13]. The STR block receives signals from two bit-
slices and C-elements allow the STR block to start its operation on two independent data operands as soon as both of 
them arrive. All C-elements are reset by an R pulse produced by another C-element from two Ready signals received 
from different bit-slices. The R signal is also used for clocking the D-cell and triggering a pi signal in the case of 
arithmetic operation or the result of the logic operation. 
The REG block (Fig. 3c) is a D-cell based block. It routes all input data out and maintains the same time delay per 
block by sending an input Ready pulse through the C-element in a way similar to how it is done in the STR block. 
The SUM block completes the addition operation by executing the XOR function between a partial sum pi signal 
of the current bit-slice and a carry signal from the previous bit-slice. During a logic operation, the SUM block simply 
passes its input data to the output. 
 
 
Fig. 3. Block-diagrams of ALU building blocks: (a) INIT; (b) STR; (c) REG; (d) SUM 
62   Timur V. Filippov et al. /  Physics Procedia  36 ( 2012 )  59 – 65 
 
Fig. 4. Block-diagram of the ALU high-speed test chip (a) with input interface (b) based on SFQ relay (c) and output interface (d) to provide direct 
and complementary data streams 
Note that INIT, REG and SUM blocks are all clocked by Ready pulses while STR blocks employ Ready pulses 
only for resetting C-elements when data waves propagate through STR blocks. This feature, called wave-pipelined 
synchronization [7], is critical for STR blocks that can receive data from two remote bit-slices. 
All ALU stages are connected by means of passive transmission lines (PTLs). Each building block has PTL 
receivers at the input and transmitters at the output (up to 7 driver-receiver pairs per block). The difference in length 
between vertical (v) and horizontal (h) PTLs is compensated by extra JTLs placed behind the receiver in the next stage 
of the ALU. 
3. High Frequency Test Bed 
The 8-bit ALU was previously tested at low frequency (LF), with results reported in [9]. To perform high 
frequency (HF) tests, the ALU was integrated with a high-speed test environment. Fig. 4a shows the block-diagram of 
the HF test bed. 
The input interface is designed to provide data operands and control signals at high frequency. To achieve this, we 
generated data and control signals by applying clock (Ready) pulses to SFQ relays controlled by dc or low-frequency 
bias currents. Each INIT block has 6 relays: 2 for data channels and 4 for control signals (Fig.  4b). The SFQ relay 
(Fig.  4c) lets an SFQ pulse to pass through when the control current is applied and rejects it when the current is off. 
By programming these relays, we effectively create any test pattern for any function working at the clock rate. In order 
to be able to observe the outputs for different combinations of relay settings with a low-frequency oscilloscope, we 
reprogrammed relays at a kHz rate. 
The input interface requires 20 pads for independent current sources to provide two 8-bit data words and a 4-bit 
instruction. A single pad is required for an HF clock source. The clock pulses are distributed between bit-slices by a 
PTL-based 1-to-8 splitter tree. 
The output interface is designed for bit-error rate testing at HF. The data from SUM blocks is converted to a 
complementary format (Fig.  4d). Then both (direct and complementary) data streams are amplified by toggling-type 
SFQ-to-dc converters [14]. A toggling SFQ-to-dc converter switches its dc-voltage state between 0.0 mV and 0.5 mV 
and back every time the SFQ pulse appears. On a low-speed oscilloscope, a 20 GHz output (steady “1”)  appears as a 
single line at 0.25 mV (the average voltage between two SFQ-to-dc states toggling at 20 GHz), while the absence of 
output signal (steady “0”)  appears either as 0.0 mV or 0.5 mV (Fig. 4a). By switching the relay control current, we 
vary input data and observe “eye-diagrams” [6, 15, 16] on the oscilloscope. 
 Timur V. Filippov et al. /  Physics Procedia  36 ( 2012 )  59 – 65 63
 
Fig. 5. Correct ADD operation of the ALU at 20 GHz: (a) “127+1/0”, (b) “255+1/0” (see text for details) 
For measuring bit-error rate (BER), we select a particular set of direct and complementary outputs for a fixed 
combination of all dc control currents, so as to provide all 0s at all chosen outputs. In this case, all traces on the 
oscilloscope are stable lines and an error manifests itself as an abrupt transition between dc voltage states of an SFQ-
to-dc converter. 
The output interface uses 16 pads for a true and a complementary 8-bit output word and an extra two pads to 
monitor the clock and decimated clock. The 8-bit ALU was integrated with input and output interfaces on a 1 x 1-cm2 
chip. The microphotograph of the ALU chip is shown in Fig. 1b. The 8-bit ALU itself, without input and output 
interfaces, occupies an area of 4480 μm x 5245 μm and utilizes 6973 Josephson junctions. The high-speed test bed 
adds 977 Josephson junctions. The total length of PTLs comprising ALU interconnections is 118 mm. The total 
number of PTL driver-receiver pairs involved in the ALU design is 170. The ALU has been fabricated with HYPRES’ 
standard 4.5 - kA/cm2 Nb process [17]. 
4. High Frequency Test Results 
The most delay-sensitive ALU operation is an addition, when a carry signal is being generated and propagated 
along the ALU circuit. Fig. 5a illustrates the correct operation of the ALU performing an addition function “127+1/0” 
of fixed operand A at value 127 and B alternating between 1 and 0. One can see that the output 8-bit number switches 
correctly between 128 to 127 in accordance with changes of operand B, modulated by LF (~100 kHz) control current 
applied to the corresponding relay of the input bit. 
 
 
Fig. 6. Correct AND operation of the ALU at 20 GHz: (a) “255 AND 255/0”, (b) “255 AND-Invert-B 255/0” (see text for details) 
 
64   Timur V. Filippov et al. /  Physics Procedia  36 ( 2012 )  59 – 65 
 
Fig. 7. Correct operation of the ALU at 20 GHz: A+B, A AND B, A XOR B, A+B for A=101 and B=45 
The extreme case of addition “255+1/0” is shown in Fig. 5b. This illustrates the correct propagation of carry 
through the whole ALU circuit and is the most critical case of all ALU operations. We measured + / - 5% operating 
margins for dc bias currents at 20 GHz. 
Fig. 6 gives an example of a logic operation. The AND function is applied to constant operand A=255 and B 
operand alternating between 255 (all 1s) and 0 (all 0s). The correct output for the bit-wise AND function is shown in 
Fig. 6a. In Fig. 6b the bias control current for inv_b is applied, inverting the output result. 
In the previous examples, input operands were variable while the selected operation was fixed. Fig. 7 illustrates the 
opposite situation where different ALU functions are selected for two fixed operands. To do so, bias control currents 
for ctrl_add and ctrl_xor were varied at ~100kHz and a sequence of ADD, AND, XOR, and ADD operations are 
applied for A=101 and B=45. The addition gives the correct number of 146 and logic operations also provide correct 
output for all corresponding bits of the operands. 
Complete functionality for all logic and arithmetic operations was confirmed at a clock frequency of 20 GHz.  The 
typical duration of error-free operation was 20 minutes for different operands and functions performed by the ALU 
described above. The corresponding bit-error rate was estimated as ~10-14. 
The total dc bias current required by the ALU chip is 1.08 A. For the nominal bias voltage of 2.6 mV that gives 
~2.8 mW of power dissipation. The energy consumption of the ALU can be significantly improved by converting the 
existing design from standard RSFQ logic [1] to an energy-efficient version of SFQ logic [18], such as ERSFQ [19] 
that eliminates bias resistors responsible for the dominant static portion of power dissipation. In the ERSFQ approach, 
there is no static power dissipation, and the dynamic power dissipation for the above 8-bit ALU would be 45 μW at 
the 20 GHz data rate. Recently, an ERSFQ 8-bit parallel adder was successfully demonstrated at 20 GHz while 
dissipating 7.2 μW [20]. 
5. Conclusion 
We designed, fabricated and successfully tested an 8-bit ALU at a clock frequency of 20 GHz. The ALU design is 
based on a Kogge-Stone adder with an integrated set of logic operations. The design employs a wave-pipelined 
synchronization approach. The integration of the ALU with the register file [2], comprising a matrix of memory cells 
with routing circuitry, will result in a complete processor datapath circuit. Such a register file is under testing and 
results will be report elsewhere. 
Acknowledgements 
The authors would like to thank D. Amparo, D. Donnelly, R. Hunt, J. Vivalda, D. Yohannes, and S.K. Tolpygo of 
the HYPRES fabrication team for producing the chips. We would like also to thank M. Manheimer and S. Holmes for 
fruitful stimulating discussions. The work is supported in part by DoD contract #W911NF-09-C-0036. 
 Timur V. Filippov et al. /  Physics Procedia  36 ( 2012 )  59 – 65 65
References 
[1] Likharev KK, Semenov VK. RSFQ logic/memory family: A new Josephson-junction technology for sub-terahertz clock-frequency digital 
systems. IEEE Trans. Appl. Supercond. 1991; 1: 3-28. 
[2] Dorojevets M, Ayala C, Kasperek A. Data-flow microarchitecture for wide datapath RSFQ processors: Design study. IEEE Trans. Appl. 
Supercond. 2011; 21(3): 787-791. 
[3] Bunyk P, Leung M, Spargo J, Dorojevets M. FLUX-1 RSFQ microprocessor. IEEE Trans. Appl. Supercond. 2003; 13(1): 433-436. 
[4] Fujimaki A, Tanaka M, Yamada T, Yamanashi Y, Park H, Yoshikawa N. Bit-serial single flux quantum microprocessor CORE. IEICE 
Trans. Electron., 2008; E91-C: 342-349. 
[5] Kogge P, Stone HS. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Trans. Computers. 
1973; CǦ22(8): 786Ǧ793. 
[6] Kim JY, Kim S, Kang J. Construction of an RSFQ 4-bit ALU with half adder cells. IEEE Trans. Appl. Supercond. 2005; 15(1): 308-311. 
[7] Dorojevets M, Ayala C, and Kasperek A. Development and evaluation of design techniques for high-performance wave-pipelined wide 
datapath RSFQ processors.  in Proc. of the 12th Intl Superconductive Electronics Conference (ISEC ’09), Fukuoka, Japan, June 16Ǧ19, 2009. 
[8] Bunyk P, Litskevitch P. Case study in RSFQ design: Fast pipelined 32-bit adder. IEEE Trans. on Appl. Supercond. 1999; 9(2): 3714-3720. 
[9] Filippov T, Dorojevets M, Sahu A, Kirichenko A, Ayala C, Mukhanov O. 8-bit asynchronous wave-pipelined RSFQ arithmetic-logic unit. 
IEEE Trans. Appl. Supercond. 2011; 21(3): 847-851. 
[10] Polonsky SV, Semenov VK, Kirichenko AF. Single flux quantum B flip-flop and its possible applications. IEEE Trans. Appl. Supercond. 
1994; 4(1): 9-18. 
[11] Kaplan S, Kirichenko A, Mukhanov O, Sarwana S. A prescaler circuit for a superconductive time-to-digital converter, IEEE Trans. Appl. 
Supercond. 2001 , 11(1): 513-516. 
[12] Kirichenko AF. Universal Delay-Insensitive Logic Cell. US patent 6,486,694 B1 (2002). 
[13] Filippov TV, Pflyuk SV, Semenov VK, Wikborg EB. Encoders and decimation filters for superconductor oversampling ADCs. IEEE 
Trans. Appl. Supercond. 2001; 11(1): 545-549. 
[14] Kaplunenko VK, Khabipov MI, Koshelets VP, Likharev KK, Mukhanov OA, Semenov VK, Serpuchenko IL, Vystavkin AN. 
Experimental study of the RSFQ logic elements. IEEE Trans Mag, 1989; 25(2): 861-4. 
[15] Mukhanov OA. RSFQ 1024-bit shift register for acquisition memory, IEEE Trans Appl Supercond 1993; 3(4): 3102-13. 
[16] Kirichenko A, Sarwana S, Gupta D, Yohannes D. Superconductor digital receiver components. IEEE Trans. Appl. Supercond. 2005; 
15(2): 249-254. 
[17] HYPRES’ Design Rules available at http://www.hypres.com 
[18] Mukhanov OA. Energy-efficient Single Flux Quantum technology. IEEE Trans Appl Supercond. 2011; 21(3): 760-9. 
[19] Kirichenko DE, Sarwana S, Kirichenko AF. Zero static power dissipation biasing of RSFQ circuits. IEEE Trans. Appl. Supercond. 2011; 
21(3): 776-779. 
[20] Kirichenko A, Vernik I, Sarwana S. Zero-static power dissipation 8-bit adder. to be published. 
 
