Introduction
Several advances in superconductor technology and design made over the last years have established the foundation of the FLUX superconductor microprocessor project. Theoretical studies and experimental chip designs based on the Rapid Single Flux Quantum (RSFQ) logic have demonstrated a great potential of this logic for ultra-fast superconductor digital circuit implementation. 1, 2 Circuit-level computer-aided design (CAD) and testing tools have been developed to cope with challenges of practical low & medium-scale RSFQ chip design and testing. 3, 4, 5, 6 A new 4 kA/cm 2 , 1.75-µm Nb/AlOx/Nb Josephson junction technology developed by TRW, Inc. 7 has allowed reproducible fabrication of relatively large chips containing tens of thousands junctions with tolerable variations of their technological parameters. The final contribution has come from the architectural and design studies of the SPELL RSFQ processors for a petaflops-scale system within the HTMT project. 8, 9 With these opportunities and support from our DoD sponsors available, we decided to proceed with the design and fabrication of the first single-chip RSFQ microprocessor prototype called FLUX. 10, 11 From the beginning, we considered a FLUX-1 chip to be a technology and design driver that would allow us to discover challenges and problems in real-world superconductor processor design. The major goals of the FLUX project were formulated as follows: VLSI-scale complexity, high performance with a processor clock rate close to 20 GHz, chip-to-chip communication at a 5-GHz rate, design scalability, testability, and tolerance to process imperfections. We will show how and to what extent these objectives were achieved in the FLUX-1 microprocessor design when presenting its architecture and implementation in the following sections.
Superconductor Nb tri-layer technology
A good match between the FLUX architecture, superconductor technology, and RSFQ logic is crucial to achieving our goals. Among the key technological factors that had strong impact on the FLUX-1 design were: the critical current density, number of metal layers, chip area, circuit speed, and interconnection density.
Complexity of any superconductor chip design depends on the number of Josephson junctions (JJs) that can be laid out on a given die area. In order to use TRW's current fabrication techniques and get an acceptable die yield, the maximum area of a FLUX chip is limited to ~ 140 mm 2 . With the technology providing the critical current density of 4 kA/cm 2 , each 1.75-µm junction has to be overdamped with an external shunt resistor of ~ 3.8 Ω in order to have its Stuart-McCumber parameter of ~ 1. Due to fabrication requirements, Josephson junctions and shunt resistors, while occupying different layers, cannot overlap each other. The ratio of junction-to-shunt-resistor area is of ~ 1:5 for the current technological process. A dramatic increase in chip density is expected in future technologies featuring ~0.3-µm junctions and critical current density exceeding 150 kA/cm 2 , when junctions become overdamped without any external shunt resistors.
12
While shunt resistors occupy quire a large area compared to junctions, it is Nb passive microstrip and strip transmission lines who consume most of the FLUX-1 chip area because of the required line widths and the relatively small number of metal layers (four including one ground plane) available in TRW's technology. 7 All FLUX gates use at least two metal layers called TRIW and WIRA plus the ground plane, i.e., leaving only one metal layer (WIRB) unused. This WIRB level alone, however, cannot be used for transmitting any signals. In order to be reliable, all gate-to-gate communication has to be done via microstrip/strip transmission lines with one 'wiring' metal (WIRB/TRIW) and one/two additional grounded planes, respectively. These strip and microstrip lines can safely cross each other, being shielded from each other by the grounded planes. Special transceivers working as 'impedance bridges' have to be placed between the FLUX gates with their 1.9-Ω output resistance and transmission lines with a ~4.6-Ω impedance. While the minimum wire width and spacing required by the topological design rules are of 1.5 µm, the actual FLUX lines are much wider: 5 µm for the TRIW strip lines and 21 µm for the WIRB microstrip lines. While the drivers' output resistance and the line impedance are matched to each other, it is not possible to provide a perfectly matched load for transmission lines at the receiver's (termination) side due to the non-linear resistance of Josephson junctions. 13 This results in partial reflections off the receiver's connection of a transmission line, which limits the maximum data transfer rate over passive transmission lines and their length. However, even with these limitations, internal FLUX signals (represented by ~0.5-mV high and ~4-ps wide pulses) can be transmitted ballistically with very low signal attenuation and dispersion at a rate of 20 GHz over relatively long (up to 3-4 mm) Nb transmission lines.
Signals travel over the strip and microstrip lines at velocities of ~ 130 and ~ 90 µm/ps, respectively. It takes approximately two clock cycles for a signal to cross the FLUX-1 chip in a horizontal direction, and three cycles in a vertical one.
Rapid Single Flux Quantum Logic
The low-loss ballistic propagation of signals over transmission lines together with extremely high speed and low power consumption are the most attractive features of superconductor RSFQ circuits. In the meantime, the use of pulses rather than voltage levels to code signal values prevents RSFQ design elements from being connected to each other by buses with shared wires. Instead, binary signal distribution trees built of pulse splitters have to be used, which leads to quite a large area to be consumed by interconnect in RSFQ designs.
In the RSFQ logic, almost all Boolean functions can only be implemented with clocked gates. Theoretically, the maximum clock frequency of a pipelined RSFQ processor can be reached with only one logic gate (in other words, one level of logic) per pipeline stage. (Current CMOS microprocessors feature from 8 to 12 levels of logic per stage.) Thus, the maximum rate at which these clocked gates can operate imposes the upper limit on a FLUX-1 processor clock rate. For any clocked gate used in FLUX-1, the reciprocal of the sum of the gate's setup and hold times gives the maximum operating frequency (F max ) of the gate. Maximum speed is not the only requirement in the RSFQ gate design. RSFQ circuits must be operational with a given bit error rate and have relatively high tolerance to noise and technological process parameter variations. Table 1 shows key timing characteristics (accurate to ~10%) of major RSFQ gates, with the design of each of them optimized for a 20-GHz clock rate and noise margins of ~30%. .0 n/a n/a n/a Line driver 2 3.0 n/a n/a n/a Line receiver 2 4.0 n/a n/a n/a Splitter 3 6.5 n/a n/a n/a Merger 5 4.5 n/a n/a n/a D flip-flop (DFF) 6 7.5 6. A use of built-in flip-flops in RSFQ logic gates allows a superconductor processor to be built with very fine-grain processing pipelines and have a very high clock frequency. Unfortunately, due to the small amount work done per cycle in such fine-grain RSFQ pipelines (can be several times less than the amount of work done each cycle in current CMOS processors), they inevitably have to be long. Without special architectural support, it would be unrealistic to keep these pipelines filled with useful operations and achieve high sustained performance in superconductor processors.
Another challenge is that the read latencies even for small 16 x 32-bit memories or register files implemented with the RSFQ gates shown in Table 1 are of several cycles long because of the time spent in the address & word decoders (built with splitters and mergers) and the interconnect in these relatively large memory structures.
FLUX-1 architecture and organization
A new architectural design requiring localized data communication between processing elements and registers has been developed in order to address the realities of the current superconductor technology and RSFQ logic. Figure 1 shows the block diagram of a FLUX-1 microprocessor. The major design units are: the 16-word instruction memory with the embedded program counter and instruction fetch logic; the branch unit with five condition flags; the instruction register and dual decode/issue logic; eight bit-stream arithmetic logic units (ALUs) interleaved with eight 8-bit general-purpose integer registers (R0-R7); two 8-bit 5-GHz I/O ports; the clock controller; built-in scan path circuitry. The FLUX-1 instruction set consists of ~25 30-bit instructions, including conditional/unconditional branches, integer add/subtract, shift left/right, swap, or, invert, load an immediate value, etc. From a programmer's point of view, each bit-stream ALU operation has a two-register format, with the requirement that its source registers are to be adjacent and its destination register can be one of the source registers. After translation into binary code, ALU operations are represented by their codes consisting of a 2-bit ALU identifier and a 10-bit horizontal code field specifying the ALU operation to be performed on registers 'neighboring' the ALU.
An 8-bit FLUX-1 microprocessor implements a new parallel architecture that extends the synchronous dual-op Long-Instruction-Word (LIW) architecture with stream-like processing of data bits in integer registers. An 8-bit word in any of the registers is treated as a vector of eight single-bit elements starting from the least significant bit 0 (LSB) and ending with the most significant bit 7 (MSB). Each register bit has its own read and write ports. Each bit-stream ALU is built with eight single-bit ALUs. A single-bit ALU (bALU) has a 3-stage execution pipeline whose data inputs and outputs are connected to the corresponding read and write bit ports of the registers located left and right to the bALU by relatively short (~ 500 µm) wires. Due to short distances between ALUs and their source/destination registers, read & pre-process as well as write microoperations on each bit can be completed within one 50-ps cycle.
Two integer operations can be issued to two of the eight bit-stream ALUs and two other operations completed each cycle. In more detail, two operations of an instruction fetched from the instruction memory are placed into two slots of the instruction register. After two cycles spent in the first level of the pipelined instruction decoder, an ALU operation from each slot arrives at the ALU decoder specified by the operation's ALU-id field. Two cycles later, the pipelined ALU decoder initiates the specified register read and ALU operations by issuing properly delayed control signals to its least significant bALU and two registers surrounding it. Pipeline control logic embedded in each register bit and bALU transmits control, carry and data signals from one bit-stage to another until the completion of the operation. Four cycles after the decoded integer operation reaches the target ALU, the result's first bit (LSB) is calculated. (A FLUX processor with 16 or even 32 bits per register would have the same start-up delay of four cycles, which means that the FLUX-1 processing core design is perfectly scalable.) During the next eight cycles, 8 bits of the result (starting from LSB) will be written bit by bit into the operation's destination register by the eight single-bit bALUs comprising the target ALU. As a side effect of the operation on register R4, three control flags can be calculated and used in branch operations later.
What makes FLUX-1 different from any bit-sequential processor is that any operation dependent on the data to be calculated by some operation-in-progress can start working with the data as early as its first bit is ready, i.e., not waiting for the operation-in-progress to complete. Each bit-stream ALU can overlap execution of up to 12 independent or 3 serially-dependent instructions working with the same 8-bit integer register.
Synchronous LIW architectures often suffer from increase in code size (so-called code inflation) caused by the no-operation (NOP) instructions. This problem is solved in FLUX-1 by suppressing the NOP instructions using a special 4-bit fetch delay field in each instruction. With a corresponding encoding of this field, the processor can postpone fetching of the instruction following the current one for the specified number of cycles (up to four), during which the FLUX hardware creates and inserts the NOP instructions into the processor pipeline. Table 2 presents architectural parameters of a FLUX-1 microprocessor. A different technique, wave pipelining, is used to implement the instruction memory whose read latency is approximately three times larger than the clock cycle time of 50 ps. Wave pipelining is used in modern CMOS design to improve processor cycle time by pipelining functional units and caches without using intermediate pipeline latches with their additional synchronous clocking overhead resulting from flip-flop delay, setup time, and clock jitter/skew. The FLUX-1 instruction memory is pipelined across three stages with three waves of pulses propagating simultaneously one by one over non-clocked splitters/mergers and transmission lines in the memory address decoder. This allowed us to provide the memory bandwidth of one (sequential) dual-op instruction per cycle.
Besides the high-bandwidth memory, a FLUX-1 processor has very fast instruction select logic associated with its program counter (PC). In order to reach a fetch & issue rate of one instruction per cycle (in the absence of branches), PC and its update circuitry were 'fused' with the instruction memory word (row line) decoder. A 16-bit PC is implemented as a shift register (with additional inputs from the branch unit and scan path logic) holding such a 'linear' 16-bit address with only one non-zero bit corresponding to the position of the target instruction word in the memory arrays (so-called 'one-hot' encoding). As a result, the next PC address calculation can be done very quickly by shifting the contents of PC by one bit. For each conditional/unconditional branch operation, the branch unit converts a binary 4-bit target address encoded in the instruction into the linear 16-bit target address. Currently, the FLUX-1 cycle time is evaluated as ~54.5 ± 10% ps. Figure 2 shows a FLUX-1 chip photomicrograph. The chip is designed with a use of a full custom design flow. The development of a new RSFQ gate library 14 designed for TRW's technological process was accomplished with a use of the SUNY circuit-level design tools, 3, 4, 5 as well as TRW technological libraries, all integrated with the Cadence CAD tools. At the gate-level, FLUX-1 is built with a small set of basic building blocks, such as a one-bit ALU & register block, a 2 x 2 memory sub-array, etc. Each of these components is designed as a custom circuit macro optimized for high performance and small footprint. Placement and wiring of all elements (chip layout) is done by hand with neither synthesis nor automatic place & route tools used.
Physical chip design
All FLUX-1 blocks operate synchronously with their clock supplied by the clock controller. The controller has several test ports (10 I/O pads total) connected to the external test equipment, allowing the processor clock frequency to be varied in a wide range from fractions of Hertz in an external clock mode, to more than 20 GHz in a normal mode when using an on-chip clock generator with the programmable number of clock pulses to be generated. This clock is distributed throughout the chip using multiplestage binary trees built of splitters.
Even if these trees were perfectly balanced (which is not the case for FLUX-1), clock skew caused by thermal fluctuations and to lesser extent technological process variations in these splitters would be quite large in any RSFQ processor with thousands of clocked gates. To solve the problem, FLUX-1 incorporates circuit-level clock delay adjustment capabilities. The clock controller drives four clock domain buffers associated with major blocks. These clock domain buffers are used for custom deskewing of clock signals. Each clock domain buffer represents the Josephson transmission line (JTL) consisting of 20 JJs in series. The bias currents of these JTLs are controlled from the outside, allowing the delays of individual junctions in these clock domain buffers to be changed in order to create the desired inter-domain clock skew of an order of ~5 ps. The elimination of signals generated by remote gates from the FLUX-1 critical path allowed us to keep the contribution from the data & clock skew relatively low (~ 8-10% of the cycle time).
The bias (DC) currents used for powering FLUX-1 RSFQ gates vary from ~ 0.2 mA for a transmission line driver/receiver up to ~1.7 mA for a TRS flip-flop. With the average bias current of ~0.1 mA per Josephson junction, the total bias current is ~7 Amps. The 8 DC pads available in a FLUX-1 chip distribute this current to eight 'power regions' each with an approximately equal number of junctions. The power dissipation of a FLUX-1 chip (without the power dissipated in the DC copper leads carrying the bias current into the cryostat) is a sum of the static power dissipated in bias registers (~ 14 mW at 2 mV) and the dynamic power dissipated in shunt resistors when junctions switch. With less than one fourth of all FLUX junctions switched per cycle and with ~ 0.06 mW dissipated per switching, the dynamic power consumption does not exceed 1 mW at 20 GHz.
FLUX-1 has full built-in testing (scan path) capabilities giving access to the contents of all registers and memory during testing. The chip has 10 test regions each with their separate scan path logic and I/O test ports. All bits of any register/memory word are connected to adjacent bits of the same or other registers/memory words, forming a sequential scan path of the region. Through these scan paths data can be shifted bit by bit into registers/memory and shifted out under control of clock signals generated by the test equipment. The 56 low-speed test pads available in FLUX-1 are used for scan path testing, clock deskewing, and I/O port tweaking. Table 3 summarizes physical characteristics of a FLUX-1 chip.
A cycle-accurate simulator was developed for the FLUX architecture verification, and partial VHDL simulation was used for functional verification of some FLUX-1 units. The full circuit-level verification of a FLUX-1 chip was not possible due to the current state of superconductor CAD tools and huge computational requirements of such a task. Instead, a multiple-chip approach including design, fabrication, and testing of several intermediate chips with increasing complexity was used. Twelve such chips containing individual FLUX-1 gates and components were designed before the first FLUX-1 chip went to fabrication after full layout-versus-schematics (LVS) and design rule checking (DRC) in June 2001. To be successful, however, any future design of an RSFQ microprocessor will require much more detailed functional and circuit-level verification. 
Current status
The first FLUX-1 'niobium' was delivered in August 2001. The final version of the FLUX-1 MCM carrier capable of passing the required bias current without overheating the MCM's DC pads was fabricated in December 2001. Currently, the processor is under testing at TRW, Inc. During testing, a cryogenic probe with a FLUX-1 chip flip-chip bonded to the MCM carrier is placed into a cryostat mounted on top of a liquid-helium storage Dewar. 15 Multi-channel Octopux system 16 connected the probe is used to apply test vectors, read-out and analyze test results, as well as control bias current sources. In order to significantly decrease the bit error rate currently observed in the first FLUX-1 chips, we consider the design and fabrication of next versions of the FLUX-1 chips with current recycling and other circuit-level improvements incorporated. In 2002, we start a FLUX-2 project whose goal is to demonstrate a multiple-chip testbed including floatingpoint and data storage RSFQ units with 20-GHz chip-to-chip communication over MCM.
