Abstract: A self-timed ring using NULL convention logic (NCL) is presented. An analytical method to evaluate the speed of NCL rings has been developed. The analytical predictions are verified by a Synopsys simulation and excellent agreement between the theoretical predictions and simulation results is obtained. Some important principles for ring optimisation are obtained. The analysis leads to the speed optimisation of a 24bit NCL divider.
Introduction
In recent years. the implementation of floating-point division has received increasing attention. The computation time of a division operation is fairly long compared with other floating-point operations such as multiplication and addition. Moreover, applications such as three-dimensional graphics use division frequently. It has been shown that ignoring its implementation can result in significant system performance degradation for many applications [I] .
A self-timed ring is an eficient approach to the implementation of division based on the SRT algorithm [2] [3] [4] [5] _ which is the most commonly used algorithm in modem processors [I] . In this ring structure, the computation can progress as fast as the circuit configuration and fabrication technology allow [6] , rather than being slowed down by the worst-case dclay or clock skew margins found in synchronous counterparts. Furthennore, due to the identity of the iterative operations in division, a self-timed ring with a few stages can be used to perform a floatingpoint mantissa division operation without speed loss compared with the fully expended pipeline version. One of the early designs was proposed by Williams (31. Since then.
several modifications have been suggested in order to improve performance [4, 51. However, all of thesc methods use differential cascode voltage switch logic (DCVSL) [6] .
Some conventional CMOS design principles are not applicable to DCVSL. For example, the rise and fall times of output nodes in DCVSL circuits are generally different, due to an inherent asymmetry in NMOS tree and PMOS load. Careful sizing of transistors is required for DCVSL to avoid possible problems associated with races [8] . For DCVSL, a timing assumption appears during the precharge phase. It is assumed that. when request signal goes low, the input data are all reset. Yet, this constraint is not always checked for validity in practice [9] . The method for self-timed ring design presented in this paper is based on NULL convention logc (NCL). which is a new clockless methodology for digital system design [IO.
1 I]. It offers a quasi delay-insensitive logic paradigm, where control is inherent with each datum. and follows the socalled 'weak conditions' of Seitz's delay-insensitive signalling scheme [12] . The NCL paradigm assumes that forks in wires are isochronic [I 31 . Besides the common advantages of general asynchronous circuits. such as being speed-independent, the average-case delay, low power, low noise and electromagnetic interference (EMI), a formal and systematic method of design and optimisation has been devcloped to make the design of NCL circuits straightfonvard [14, IS] . A complete register-transfer level (RTL) design flow for NCL can be performed using commercial CAD tools [16] , such as Synopsys and Modelsim VHDL simulator.
However, the speed of an NCL ring circuit dcpends not only on the latencies of individual stages in the data path, but also on the speeds of handshake circuits and the number of stages. To investigate how these factors affect the ring speed, an analytical method is developed using dependency graphs. Performance analysis of self-timed rings yields important implications for designing NCL rings with optimal speed performance or with an appropriate speed-arca tradeoff. These implications are applied to the, design of a 24-bit floating-point mantissa division.
2
NCL uses symbolic completeness of expression to achieve self-timed behaviour [IO] . A symbolically complete expression is defined as an expression that only depends on the relationships of the symbols presented in the expression without a reference to the time of evaluation. In general, a multi-rail sipdl can be used to incorporate data and control information into one mixed signal path to eliminate the time reference, and therefore to form a symbolically complete cxpression. Typically, a dual-rail signal D consists of two wires, DO and D I , which represent a value from the set {DataO, Datal, Null], shown in Table I The hysteresis behaviour requires that the output only changes after a sufficiently complete set of input values has been established. In the case of a transition to DATA, the output remains at NULL until at least A4 of the N inputs become DATA. In the case of a transition to NULL, the output remains at DATA until all N inputs become NULL.
As special examples, an N-of-N gate is an N-input Muller C-element while a 1-of-N gate corresponds to an N-input OR gate. Transistor-level design for M-of-N threshold gates is described in [17] . As an example, a 2-of-3 gate having inputs A: B and C is shown in Fig Threshold gates can he used to build combinational circuits. An NCL half-adder is shown in Fig. 2 , where (xu, XI) and (yo, y~) denote the dual-rail encoded input
771re.drold gate with /IJmerESk addends. (ea, el) and (SO, sl) denote the dual-rail encoded carry and sum outputs, respectively.
Y-1 x-l T 2 3 P
Fig. 2 NCL half-addrv
A typical NCL circuit structure forms a pipeline, in which each stage consists of a register, completion detection circuit and a combinational circuit [IO] . An NCL ring is formed when the output of a pipeline is connected to its input as shown in Fig. 3 . For simplicity, a half-adder is used as the Combinationdl circuit for each skage in the ring. Note that, although one of the three bits in the data bus is not involved in the half-adder computation, it puts a load on the completion detection block. All of the registers are initialised to Null outputs except that the outputs of register RN are initialised to Data0 or Datal. Fig. 4 illustrates the gate-level structures of register and completion detection circuit used in the ring. Th22nx0 (Th22dx0) is a 2-of-2 threshold gate that is initialised to NULL (DATA). Thl2bxO is a I-of-2 threshold gate with an inverter following. Th22xO is a 2-of-2 threshold gate without initialisation control. 
Analysis of rings using dependency graphs
An efficient analysis method for ring speed performance is based on the use of a dependency graph, which was proposed by Williams [2, 181. Williams' theory on rings shows how to analyse a ring performance in terms of speed and area. This theory will he used below to develop an analytical model for evaluating self-timed rings implemented in NCL.
First, several parameters need to be defined. Local parameters includc the forward latency (L,), the reverse latency (LJ, and the local cycle time (P). The forward latency specifies the delay from valid data outputs at one stage to valid data outputs at the following stage without waiting for request signal. The reverse latency specifies the delay from the request of a stage output to the request of its predecessor's output without waiting for Data or Null in the data path. The local cycle time specifies the delays of all the transitions necessary for a stage to pass a token, and become enabled again for thc next token. This limits the maximum throughput of a ring. A key global parameter is the total latency i, that is the delay between the introduction of a new data token into the ring and the removal of the corresponding processed token after the token has passed through stages necessary for solution of an iterative problem. Since the number of iterations for a given problem is constant, and each iteration corresponds to one stage, the total latency is proportional to the average time for a token to pass through one stage. The average propagation delay of a single stage will be used as a metric of ring speed.
Folded dependency graph
A dependency graph of a pipeline can he constructed from its structure. Fig. 5a shows the dependency graph of a pipeline corresponding to Depmdemy gmpU Jib NCL pipeline Dependency graphs can be used to determine both the stage latencies (forward and reverse) and the local cycle time [2] . When the stages are identical, the folded dependency graph can he used to calculate the local parameters more conveniently as shown in Fig. 5h where I, is the ith node transition delay in the cyclic path chosen and it'; is the corresponding stage index offset.
Local analysis
As a particular example, the latencies and the local cycle time of NCL pipeline can be analysed, based on folded dependency graph. The forward latency is given by
The reverse latency is given by
To determine the local cycle time, the longest cycle with zero offset needs to he found. There are three possibilities for the longest zero offset cycle in Fig. 56 as follows:
The lengths of the three paths are
Since the sum of the first four terms of P2 is the middle point of the sum of the first two terms of PI and the sum of the first two terms of P3, the local cycle time is:
From (I), (2) . and (4). the lopal cycle time is obtained in terms of the forward and reverse latencies:
I 69
Ring performance analysis
The local cycle time of an NCL ring satisfies P = 2(L,, + L?), which means that the token flow rate is limited either by the forward latency (data-limited) or by the reverse latency (bubble-limited) [la] . We consider the case that there is only one token in an N-stage ring. There are (N-2) bubbles in the N-stage ring. When N is larger, there are so many bubbles in the ring that the token circulation period will fall in the data-limited region, and
When N is smaller, there are so few bubbles that the token circulation period will fall in the bubble-limited region [IX] , and
In the data-limited region, the delay of the completion detection is removed from the critical path. If two local parameters are given, the optimal number of stages is defined as the minimum N in the data-limited region. It is achieved when (6) and (7) are equal to
The final expression of the token circulation period is given by Therefore, the average time for a token to pass through one stage is
We see from (8) that, when L, is decreased by speeding up the register and/or completion detection circuit, the number of stages required for optimal speed will be reduced. On the other hand. speeding up the forward data path will lead to an increase of the number of stages required for a higher optimal speed. It is also implied from (IO) that a ring with too few stages will fall in the bubble-limited region, where the ring speed can he improved by speeding up the reverse latency and/or increasing the number of states until the optimal speed is achieved. A ring with more stages will fall in the data-limited region, where the ring speed is the optimal speed that depends only on the forward latency Ly
Ring initialisation
A token introduction, as well as the characteristics of the threshold gates? requires a correct initialisation for NCL rings. Since the threshold gates used in combinational circuits have no reset input for initialisation, the validity of the initial output for each gate is guaranteed only by the completeness of input signals. However, dual-rail code (I, I)
is forbidden in NCL. Therefore, it is convenient that all inputs of a combinational circuit are initialised as Null so that the circuit initial outputs are validly all Null. To realise this idea for NCL ring initialisation, an additional regster with initialised Null output is inserted between the register with initialised Data output and the following combinational circuit, shown in Fig. 6 .
I70

Fig. 6 Ring wit11 initialising stage ( R , D,)
For a ring initialised in this way, it is not necessarily the case that each stage performs identically. Based on the analysis in subsection 3.3, the additional register contributes a delay to the token circulation period in the data-limited region, whle an additional bubble will modify the token circulation period in the bubble-limited region. The token circulation period for a ring with an initialising stage is given by where R is the maximum of Rr and R J . The optimal number of stages for a ring with an initialising stage is similarly defined as the value corresponding to the intersection of the bubble-limited region and data-limited region, which is given by
Note that N&mi is slightly smaller than N,l,,tjmi in (8). The corresponding average time for a token to pass through one stage is given by Noting the fact that R is usually small, (13) shows that the introduction of the initialising register improves the ring speed in bubble-limited region while leading to a negligble degradation of the ring s p e d in a data-limited region.
Simulation results and discussion
To verify the effectiveness of the above analysis, the ring circuits are simulated by VHDL code in Synopsys. The delays of threshold gates specified in the VHDL library are listed in Table 2 . The three local parameters Lfi L,, P based on the gate delays are also calculated in Table 2 . First, the ring performance can be evaluated by equations in Section 3 without running the VHDL simulation. The token circulation periods as a function of the number of stages are plotted in Fig. 7 . The average times of one stage as a function of the number of stages are plotted in Fig. 8 . In these Figures, the dashed lines correspond to (9) and (IO), while solid lines correspond to (1 1) and ( I 3).
Second, by writing VHDL codes for ring circuits and running simulation, the token circulation period can be measured based on the output waveform and thus the ring performance can be obtained. The corresponding results are plotted in Figs. 7 and 8. The analytical and simulation curves agree with each other very well. Although a discrepancy occurs due to datadependent delay and the assumption that the half-adder has the same delay for rising and falling transition, it is negligible.
The results in the previous Section can be used as a guideline to Optimisation for SRT division circuit. A radix-2 SRT division algorithm can be implemented by designing the combinational circuits in the ring structure. Fig. 9 shows the combinational circuit for one stage of a 24bit division. When the data wavefront arrives. the combinational circuit produces one bit of the quolient, and the partial remainder that will be used in the next stage. To speed up the calculation of partial remainder, the partial remainder is From the analysis, as well as simulation, some important principles for optimal ring design are suggested. When the information of R, F, and D delays is known, the optimal number of stages can he determined by (4), which means that further increase of the number of stages doesnot improve speed, and that a ring with fewer stages will become slower. The optimised speed only depends on the forward latency Lfi independent of the delay of completion detection block.
As for an optimised NCL ring, further speed improvements of components R, F. and D provide different opportunities to further improve the ring overall performance. Improving repister speed directly leads to increasing the overall ring speed, while the number of stages almost doesn't need to change. Speeding up the combinational circuit doesnot improve the ring overall speed unless the number of stages is increased. Speeding up the completion detector doesnot improve the ring speed, but it allows the number of stages to be reduced without speed loss.
