We present a novel asynchronous RSFQ digital circuit, Test-Timed RSFQ digital circuit and system(TT), in this paper. With this asynchronous approach, data is transferred in a delay-insensitive fashion to avoid the overhead of global clock distribution and the timing uncertainty. According to the scheme, the timing signal of the logic module is generated by a test logic module. The delay module can be removed from our circuit which should be used in previous asynchronous circuits. Finally, the simulation results for the Test-Timed data processing pipeline based on TT scheme are presented.
Introduction
With the increasing needs for the higher computation ability in many high-tech areas, the high performance computer remains its importance in computer industry. However, according to the most authoritative industrial forecast, to achieve a petaflops scale computer with the standard CMOS technology will lead to a total power consumption of the order of 10MW [1] .
With the ps order gate delay and the extremely low power consumption, superconducting RSFQ digital circuits have showed strong possibility in the applications for the high performance super-computer investigations. In RSFQ digital circuit, data are represented by a very short voltage pulse's presence (logic "1") and absence (logic "0") in a clock window. Up to present, many achievements based on RSFQ to construct the high-performance computer system have been made in projects such as HTMT [2] , FLUX [3] , and etc.
If we construct the RSFQ digital circuit directly using the conventional synchronous circuit theory, we will face many difficulties and only achieve little performance improvement with respect to the modern supercomputer. The RSFQ circuit needs a very accurate clock timing to produce a right result. In RSFQ digital circuit, a clock distribution tree over a 1-cm×1-cm chip may accumulate timing uncertainty on the order of 100 ps [6] . During the RSFQ chip programming, we have to pay much attention to the clock distribution, otherwise we will obtain the unexpected logic function with a lot of uncertainties. To conquer the difficulties in synchronous RSFQ circuit design, the RSFQ asynchronous circuit structure has been adopted these years. The RSFQ asynchronous circuit will be driven by the handshake signals instead of the global clock timing. There are many improvements in these kinds of investigations, such as PDDRL [4] [5] circuit and DDST [6] [7] circuit. These two models can remove the global clock and avoid the problems lead by the global clock. However, PDDRL requires double amount of hardware as a common single-rail logic circuit. Its realization will be limited by the low integrating extent of the currently superconducting RSFQ circuit fabricating technics. For the DDST module, people should pay more attention to the logic cell and the timing signal transmission delay. If the timing signal arrives at the DFFC unit before the data of the logic unit, the DFFC will output a wrong data.
This paper presents a delay-insensitive asynchronous RSFQ model, Test-Timing model. With proper combination of the Test-Timing cell and logic cell, we can ensure the right arrival order of clock pulse and data pulse, and the right operation.
Test-Timing Model
In synchronous RSFQ circuits, data are encoded with the presence or absence of an SFQ pulse within a timing window . The timing signal is the periodic clock signal from the global clock. The clock period must be larger than the largest cell delay in a synchronous RSFQ circuit. For timed RSFQ logic cell, the data pulses must arrive before the timing pulse, otherwise we will obtain unexpected results.
In our TT model, we use a proper combination of a timed logic cell and a modified timed logic cell to ensure that the data pulses arrive first. The modified logic cell has the same function as the timed one, but outputs the result pulse as soon as the computation is completed without needs of further external timing clock. Fig.1(a) is the circuit diagram of the TT combination. In Fig. 1(a) , the test logic cell is the Test-Timing cell to produce timing signals. The logic cell is the logic function cell to carry out a logic computation. The "At" and "Bt" are the test pulse inputs to the test logic cell to produce the CLK pulse whenever the logic computation is completed. "Ai" and "Bi" are the data pulses. Whatever the "Ai" and "Bi" are, we need a CLK pulse outputted from the test logic. So we should specify the value of "At" and "Bt" to meet this requirement. Usually we set the "At" and "Bt" as logic "1" , and modify the connection of the test logic according to the logic cell.
In this way, we can ensure a CLK pulse appears in the test logic output port every data input. We can easily get the changed logic cell by adding an asynchronous branch in the timed logic cell. An example is presented in [5] . We can also get the test logic cell through removing the timed output port and a few changes in the circuit parameter of the timed logic cells.
The logic function of the test logic cells and logic cells is the same, so the computed delay is the same. When the computation of test logic cell is completed and a CLK pulse appears, the logic cell must finish its computation, and the result is ready. So the CLK will generate a right result pulse in the logic cell output port. Fig. 1(b) presents a TT based OR cell. The upper is test OR cell [5] . The low is timed OR cell to carry out the OR logic function. When the input port "Ati","Bti","Ai"and "Bi" coming pulses, the test OR cell and timed OR cell will carry out a logic OR operation immediately. When the computation is completed, the test OR cell will output a pulse in asynchronous port right away. The timed OR cell will save the result in the J 6 − L 6 − J 7 loop and wait for a pulse input to the CLK port "L 7 − J 8 ". After the OR cell logic computation delay, the result will output in "or data output" port of timed OR cell.
In contrast with other handshaking schemes, this approach can substantially reduce the connection wire delay and circuit complexity with respect to the synchronous circuit. The timing signals are distributed without any extra delay. The junction amount needed in TT scheme is a bit larger than conventional single-rail logic circuits, but with this scheme we can construct an efficient and robust asynchronous system without any timing problems and delay modules. With the increasing of the bits to be processed, this is ignorable.
Tt Data Process Pipeline
A very excellent data processing pipeline is presented in [8] . It consists of 7 elastic pipeline stages with DR2 cell used for data storage. When a stage is in passive mode, the delay module is determined by the DR2 read-out time. In active mode, the delay is larger by the time necessary for a logic cell to produce the result.
To ensure a proper operation of the elastic pipeline, we have to make the precise estimation of the logic cell compute delay. It is very difficult to obtain precise delay estimations of several different logic cells on the layout level, a small estimated error may lead to extra delay or wrong operations. In this paper, we present a delay-insensitive data processing pipeline based on our TT scheme. This pipeline will no longer need the delay modules and guarantee proper operations. Fig. 2 shows the cell-level data processing pipeline. It is also an elastic pipeline similar to the [8] in structure, but the significant difference is that there is no delay modules in our pipeline. In our pipeline, the active or passive mode of a stage is selected by DR2 cell, "Ati" and "Bti" always input "1". The pulses in "Ati" and "Bti" will store in DR2 cell. When active mode select signal comes, two pulses will import "andt" cell. When the computation of "andt" completed, a result pulse will appear at the "andt's" output. The output pulse will be the clock signals of this stage to generate a right result from timed cell. The output pulse will be a REQ signal in next stage. Next section, we will provide the simulation result of pipeline.
Logic Simulation
There are several kinds of software which in junction level to simulate the RSFQ circuit, such as JSIM, WRspice, etc. It is convenient to use such software to verify the function and timing of RSFQ logic gates. But with the increasing scale of RSFQ digital circuits, the computation time consumption will no longer be tolerable.
We should find another way to simulate the large scale RSFQ circuit in the logic level. The hardware description languages, such as verilog HDL and VHDL, have been adopted to simulate a RSFQ circuit in logic level [8] [9] . With these languages, the RSFQ basic gates including of timing parameters (such as hold time, setup time, minimum separation time, etc.) can be written as functional models. A large RSFQ circuit consisting of thousands of gates can be simulated using standard semiconductor CAD tools such as cadence, verilog HDL, etc.
Therefore, the precise HDL description model to RSFQ gate is the basis of achieving the correct RSFQ logic simulations and right time domain simulations. In this paper, we use verilog HDL to describe RSFQ gate. For logic simulation, what we are concerned is the SFQ pulses' presence (absence) and its center position, the transfer delay of the logic gate. We don't need to concern the pulse length and its exact shape. To simplify the description, we set pulse width and delay 2 ps. In fact, these might be changed according to the actual cell parameters.
From the above discussion, we write all the cells in verilog HDL, and specify the delay of all modules at first. Then we link them according to actual connections. The simulated result of data process pipeline is shown in Fig.4 , in which we selected "and" operation of the pipeline. We can find that the input data is 8'b11010111 and 8'b10000111 and the result is 8'b10000111, small time later, an ACKO signal is output from the module. 
Conclusions
This paper presents a novel RSFQ asynchronous digital circuit model, and TT model. As an example, we construct a data process pipeline based on TT model and present its logic simulation results. In TT model structure, the arrival order of the data pulse and clock pulse will be ensured by TT model circuit. The problems caused by global timing are avoided. The simulation results show that the RSFQ TT model has significant advantages in the RSFQ supercomputer layout designs.
