This paper describes an efficient way to synchronise and pipeline asynchronous circuits built out of precharged function blocks. literature, is used in order to improve the storage capability of these precharged function blocks. Improving the storage capability of the building blocks allows the design of an efficient pipeline scheme that is described in detail. After its potential performance has been described, the pipeline scheme is applied to the design of self-timed rings. It is shown that more compact ring structures can be obtained without loss of performance. Our design methodology is then presented. It is based on the use of a private asynchronous standard cell library which is fully compatible with an existing CMOS standard cell library provided by the foundry. Our approach allows the design of standard cell based asynchronous circuits very quickly. Finally, both the pipeline scheme and the design approach are illustrated to design a self-timed ring divider. The division algorithm is first presented, together with an extension that provide square root extraction. The chip architecture is then described with the results obtained after fabrication. The test chip has been fabricated using the SGSThomson/CNET 0.5 µm three metal layer technology. The 0.7 mm 2 chip computes 32 bits divisions in 101 ns with a power consumption of 30 mW at a throughput of ten million operations per second.
involved in the pipeline schemes. Asynchronous pipelined circuits based on the use of registers are very similar to synchronous pipelines, except that they are locally controlled. Many types of asynchronous circuits (self-timed, speed-independant, quasi-delay-insensitive, delay insensitive) can be designed using this technique, depending on the delay insentivity properties of the elements involved in the pipeline (register, control and logic parts).
Figure II.2 : a pipeline stage structure when implementing the storage elements with registers. When synchronous registers are used the acknowledge signal is directly issued from the the request signal, introducing a timing assumption (the output of the register has to be valid before the logic starts computing; known as the bundling data constraint).
Pipeline schemes based on latches.
In this type of asynchronous pipeline, a latch is used to separate each stage. It does not allow full computation concurrency between stages, only one stage out of two can compute at a time. This kind of pipeline is close to the single phase clocking technique used in synchronous circuits [15] [16] . But here again, stages are controlled using local handshake signals.
Such pipeline schemes may be constructed using standard latches as described in figure II.3, or using Muller C element or DCVSL latches as storage elements. All kind of asynchronous circuits may been designed using this pipeline scheme involving latches.
Micropipelines uses standard latches associated with a two phase protocol [17] , Williams circuits uses Muller C elements associated with a four phase protocol [18] , Tan technique uses Self-timed precharge latches associated with a four phase communication protocol [19] . 
Pipeline schemes based on non explicit latches.
This last type of pipeline combines the logic and the storage element. It relies on the use of precharged function blocks which allows the data to be temporarily stored at the output of the function blocks [18] . Hence, storage elements are not explicit in these structures. Moreover, a single stage performs a computation and part of the memory function required to pipeline a circuit. Several stages are then required to implement the equivalent of a full storage function. The main advantage of this pipeline scheme is that it does not require any register or latch, leading to lower computation time and area.
Our work is based on this kind of pipeline which is described in the next section. The choice for the Differential Cascode Voltage Switch Logic to implement the precharged function blocks mentionned above, and required to design such a pipeline scheme, is also briefly argued.
B. Pipeline schemes based on DCVS Logic
The DCVS Logic is an attractive way to implement asynchronous operative functions [20] [21] [35] and has been widely used in many asynchronous designs [22] [23] [24] . Figure II . 4 recall the structure and principle of a DCVSL cell [1] . A DCVSL cell is built of two parts, a
Load and a Functional Tree. The Load is controlled by a request signal whose level asks for a computation (high) or a precharge (low). When precharging, the cell produces an invalid data, or spacer, on the dual rail coded output. In full-controlled precharged logic the request signal also disconnects the tree from the ground level to prevent firing that could occur if valid inputs were present while precharging. When a computation is enabled, one and only one of the internal nodes is discharged through the functionnal tree as soon as the inputs are valid. The output of the cell is directly the dual rail code of the data computed by the functionnal tree. To provide a completion signal, the internal nodes are simply nanded or nored. The completion signal can be generated "nanding" the internal nodes or "oring" the output and the complemented output.
Despite its higher power consumption than standard CMOS circuits [25] , the DCVS Logic has important advantages for asynchronous circuits. It is first very fast, since it is based on a dynamic domino type logic. The complexity of a DCVSL cell is a little higher than standard CMOS cells, resulting in a slight increase of the number of transistors and area, when compared to CMOS circuits.
There are two main motivations to using DCVS logic cells for designing asynchronous circuits. First, its ability to elegantly and efficiently implement a four phase communication protocol with dual rail coded data, and secondly to provide a very reliable completion signal. To illustrate the behaviour of a pipeline circuits built from single stages, three stages need to be associated ( Figure II .5). If we assume that initially all the signals are low, the pipeline works as follow. When a request is applied to stage 1 and the input data are valid, it computes and generates an acknowledge signal. This acknowledge signal starts the computation of stage 2, which generates its acknowledge signal. The stage 2 acknowledge signal allows the next stage to compute and also allows stage 1 to precharge. Hence, at this point, stage 1 is able to precharge, stage 2 is holding its output, and stage 3 is able to compute. Until now the computation wavefront has been propagated from stage 1 to stage 3.
It can now be followed by the propagation of the precharge wavefront. It must be noted that at some point, three stages are in three different states. One is precharging, the following is holding its ouput data and the next one is computing.
Hence, in asynchronous pipeline, the control activity is localised in one stage when registers are used, in two stages when latches are used, and the local control involves three stages when using non explicit latch and precharge function blocks. This is, of course, due to the memorisation capability of the storage elements used.
Starting from this idea, in the next section we investigate in depth how to exploit the memorisation capability of the load in the DCVS Logic and furthermore, how to make it more efficient in order to improve the pipeline scheme. To do so, we apply a slight modification of the loads in order to increase the storage potential of the cells. We then analyse the impact of this type of logic, called Latched DCVS Logic, on the pipeline scheme behaviour and performance.
III. A new pipeline scheme based on Latched DCVS Logic

A. Motivations
Coming back to the behaviour of a DCVSL cell, we investigate in detail the timing relationship between the input data and the outputs of the cell under control of the request signal. Let us first consider the computation phase. When the request signal is high, the computation is enabled, and takes place only when valid data are applied to the tree. One of the nodes is discharged and the corresponding output goes from low to high. This behaviour is potentialy delay insensitive since the validity of all the incoming data is checked before an output is provided.
On the other hand, with the structure of figure II.5, driving request to low for precharging immediatly causes the outputs to go low without checking the input data state.
It means that the succeeding cell may receive a spacer (or invalid data) which is not propagated but generated by the previous cell. This is at the origin of a timing assumption that has to be respected somewhere in the circuit. Consequently, the pipeline structure of figure II.5 does not implement a delay-insensitive or quasi delay-insensitive circuit. The timing assumption appears during the precharge phase; before request goes low we assume that the data have been reset. This is not checked at all. Adding the logic to make the pipeline scheme quasi delay-insensitive is possible but is expensive.
What we want to do is to take advantage of the storage capability of the load cell while keeping realistic timing assumptions, such as the one described above. It is not really necessary to have a delay-insensitive behaviour during the computation phase while it is not the case during the precharge phase and consequently the circuit cannot be delay-insensitive anyway. So, there is no point not to modify the cell structure and control to improve performances even if the same type of timing assumption is introduced during the computation phase. Basically, we thought that the behaviour of this kind of circuit is in some sense too luxurious with respect to timing assumptions during the evaluation phase. This is the basic idea which underlies the design of the new pipeline scheme proposed. The following sub-sections first present the Latched DCVS Logic building blocks and then describe the new pipeline scheme.
B. Latched DCVS Logic building blocks
The goal of the Latched DCVS Logic is hence to improve the storage capability of the precharged function block. From the behaviour analysis of the pipeline, it can be seen that while one stage is computing, another stage is precharging and a third one is holding its output data. The latched DCVS Logic is introduced to only use one stage to both precharge and hold the output data. So we though of modifying the precharged function block to maintain the output data while it is precharging. This can easily be done adding two NMOS transistors T 1 and T ' 1 to the output inverters of conventional DCVSL (see Figure III. 1), similar to the latches used in True Single Phase Clocking schemes using dynamic logic [15] [16] and later applied to differential logic (called LCDL for Latched CMOS Differential Logic) by Wu and Cheng [2] . However, LCDL uses a cross-coupled latch that we avoided because it is slower than the pseudo-static load described in Figure III From a functional point of view, such Latched DCVSL cells are able to memorize a full token, a spacer in the internal nodes and a data dynamically in the output latch. It must be noticed that all the pipeline schemes proposed, and based on precharged function blocks, use the classical four phase handshake protocol.
C. The new pipeline scheme
As we now dispose of precharged building blocks able to memorize data during the precharge phase, let's see how to take advantage of this feature while associating such cells in a pipelined fashion.
So now, when a function block has been evaluating and has been producing the outputs, it can immediatly be precharged, and the next stage is potentially able to compute.
If we look in detail at both the evaluation and the precharge phases, their behaviour are as follows.
Evaluation phase. The conditions for a stage to be able to evaluate are : a) the input data must be valid, b) the next stage must be in the precharge phase, i.e the previous evaluation must be completed (it ensures that all computed data coming out of a stage are computed by the next stage).
Condition a) is checked using the completion signal. But now, as there is no spacer between valid data, a timing assumption appears which must guarantee that the function block is starting the computation after the new data are valid at its inputs (data bundling constraint).
Condition b) is easily checked using the control input of next stage.
We obtain the following relation to design the evaluation control part of one stage :
Acki-1 and Reqi+1 -> Reqi up.
Precharge phase. The conditions for a stage to be precharged are now the following. a) the input data must be invalid. As before this cannot be checked, except that now it cannot be checked at all because there isn't any more spacer. However, what can be done is to sense the Acki-1 signal to be low (corresponding to "previous stage is precharged"), independently of the input data state. b) the next stage must be in the evaluation phase, i.e the previous precharge phase must be completed (Reqi+1 must be high).
The resulting precharge condition is therefore expressed as : not (Acki-1) and Reqi+1 -> Reqi down.
During the precharge phase, apart the timing assumption described above, a new timing assumption exists. When stage "i" is being precharged, we assume that the outputs are not modified, but this is not checked by the control logic. The corresponding timing assumption appears in the fork present in the load of the function block. In fact, the Req signal drives the precharge-transistors P1 and P2 but also transistors T1 and T'1 which isolate the internal nodes from the output. The control logic only checks that the precharge is performed (sensing the completion signal) but does not guarantee that the outputs are not altered. For the cell to work properly, transistors T1 and T'1 have to turn off before transistors P1 and P2 turn on. Fortunately this problem is easily solved in the load by sizing the transistors properly. Moreover, with a standard cell based approach, such as the one described in section V, the problem introduced by this fork is local to the cell, and is solved independantly of the way cells are connected to each other.
Finally the control part of one stage is simply designed combining the conditions for a stage to precharge and evaluate. Figure Actually we changed the communication protocol controlling exchange of data between stages. We removed the spacers from the communication protocol, keeping their existence only inside the cells where they are needed in order to check the state of the cell and generate completion signals. As we said before, spacers were not propagated by the cells, but simply generated. They were not really used by the control part of the cells, so it made sense to remove them as we did.
As a result we introduced a new timing assumption during the evaluation phase which is known as the "bundled data constraint". The evaluation must be triggered after the data inputs have arrived to a given cell. If we compare the paths taken by the data and the control signal generating Req, it can be seen that this timing assumption is not hard to respect. In fact, function block i first issues an acknowledge signal from its internal nodes by gating them. This acknowledge signal then goes through the Muller C element of stage i+1 before driving the control input of function block i+1. On the other hand, data are generated from function block i through the dynamic latch which is very fast, and they then directly drive the inputs of function block i+1. Hence, the data path should easily be faster than the control path. This simple analysis assumes that the delays introduced by the wires used to convey data and control from stage i to stage i+1 are identical, which is likely at this circuit level.
As a conclusion, respecting this timing assumption is easy and, importantly, it can be respected independently of the computation time. We then end up with a communication protocol using only two phases to exchange data and four phases for the synchronisation.
The behaviour of the proposed pipeline scheme is now very similar to the behaviour of the pipeline schemes built with latches, except that no explicit latch is used and that data are always valid at the output of the cells.
D. Performance analysis
It must be first noticed that the control circuit is very simple and does not introduce any extra delays when compared to previously proposed pipeline schemes. The pipeline scheme has been modified at the cost of only two extra transistors added in the load cells.
We here after use Williams theory to compute the forward latency and the cycle time in order to compare the new pipeline characteristics with the previous ones.
The latency of the new pipeline scheme is identical to the latency of the classical pipeline scheme using precharged function blocks and presented in Figure II .4 (denoted PC0 in Williams Thesis). Its value is : max (tF↑ + tD↑ + tC↑, tF↓ + tD↓ + tC↓). "tF" denotes the function computation time, "tD" the completion signal generation time and "tC" the control delay time.
On the other hand the cycle time of the new pipeline scheme is reduced when compared to the previous ones. Its expression is the following : P = (tC↑ + tC↓) + 2 max (tC↑ + tF↑ + tD↑, tC↓ + tF↓ + tD↓). This is due to the fact that only two stages instead of three are involved in a cycle. In fact, two consecutive cells are now allowed (one to evaluate and the other one to precharge) which was not the case in the previously proposed pipeline. The control no longer checks the state of a third stage to start a computation or a precharge.
Finally, performances can potentially be improved because a shorter cycle time may improve the throughput. Moreover, the higher storage capability of the cells allow the design of two pipeline-stage circuits, whereas a minimum of three was required before. This property is exploited in designing self-timed rings as described in the next section.
IV. Pipelining Self-timed rings
As stated before, this new pipeline scheme takes better advantage of the storage capability of the precharged logic. Consequently, it allows the design of more compact structures. In order to clearly demonstrate this benefit, we illustrate the use of the new pipeline scheme in the design of self-timed rings. In fact, self-timed ring behaviour, complexity and performances are strongly related to the pipeline scheme used [18] . Hence, we here-after used self-timed rings as a relevant way to demonstrate the efficiency of the pipeline scheme proposed.
A. Principle
Self timed ring is a very attractive class of asynchronous architecture. At first sight it is best adapted to implement iterative algorithms. However, it relies on the basic idea of adjusting the latency and the throughput in an asynchronous pipeline which can be applied to a large extent to the design of most pipelined asynchronous circuits [27] .
Self-Timed ring is first an efficient way to design very compact circuits, involving minimum hardware. Ted Williams theory on rings showed how to make them as fast as possible with the minimum hardware [28] . It makes this architectural class well suited to the design of complex chips involving a high number of transistors, with sub-chips running at moderate speeds. Moreover, in complex chips, self-timed rings concentrate high speed signals locally. A self-timed ring can be seen as a standard asynchronous sub-circuit which communicates with the environment using a standard protocol. The communication speed is slow when compared to the inside activities of the ring. It may lead to a significant reduction of design complexity and time [29] . It is an excellent illustration of the benefits of the local controllability of asynchronous circuits. Finally, at a functional level, self-timed rings can take advantage of the early completion properties of the algorithms. It is often possible, without a significant increase of hardware complexity, to compute stopping conditions on line, in order to interrupt the computation as soon as the result is available. This is an elegant way to increase the average computation time and also to decrease the average power consumption. We
B. DCVSL self-timed rings
When classical DCVSL cells are used to implement the stages of a ring, there is no explicit latch or registers between the stages. The data are partially stored in the DCVSL cells.
As explained before, three stages are required, one stage is precharging, another one is holding the data for the last one to compute. It is then obvious that self-timed rings build with such cells require at least three stages to be able to compute. In the ring context, two stages are storing a token, spacer plus data, and one stage hold a bubble. The diagram of So now, one stage is able to store a full token and another one hold the bubble.
Hence, two stages are enough for a ring to work properly. This is the first advantage brought by the new pipeline scheme to the design of self timed rings. The minimum number of stages required for the rings to work is reduced and then more compact structures can be designed.
D. Optimized Latched DCVSL self-timed rings
We now look at the benefits of the pipeline scheme proposed when designing optimized self-timed rings as defined by Williams [28] . We recall that the goal of the optimization is to obtain the best throughput for the ring with the minimum hardware, i.e the minimum number of stages.
We here consider self-timed rings processing a single token and in this case the optimum number of stages is given by ONS = P/Lf with P the cycle time and Lf the forward latency of a stage. As shown before, the cycle time P is reduced with the new pipeline scheme and hence the minimum number of stages is decreased. This improvement is demonstrated in section VI with the design of a self-timed ring divider.
V. A standard cell based approach to designing self-timed circuits
Before demonstrating the benefits of the pipeline scheme with the design of a test chip, we describe the design methodology we follow. In order to be able to rapidly design self-timed circuits based on precharged function blocks (using both DCVSL or LDCVSL), we decided to specify and design an asynchronous standard cell library. The main motivation was to design precharged function blocks as standard cells that they can easily be associated with standard CMOS gates in a design. In fact, our goal was to use precharged function blocks to implement operative functions and to use standard CMOS gates to implement the controllers. To do so, the precharged function blocks have been design as standard cell respecting the format of the cells of an available library. We made the choice of the SGS-CNET three metal layer 0.5 µm CMOS technology for which a standard cell library is available and maintained. The next sub-sections describe the structure of the precharged function standard cells and the library environment.
A. Precharged function standard cells
As presented before, a precharged function block is built out of two parts, a functional tree and a load. As illustrated before, the role played by the load is essential with respect to the storage capability of the cell, whereas the role played by the functional tree is essential with respect to the computation the cell is required to perform.
We then decided to separate the design of the loads and the functional trees in order to provide the maximum flexibility. In fact, loads and trees function can be indentified independantly first and then associated to obtain the final cell. Two sets of standard cells The load set. We designed two different loads, a standard pseudo-static load, as currently used in DCVS Logic cells, and a load with latch, as described in Figure III. 
1.
The functionnal tree set. We designed several trees performing standard functions such as "and2" (nand2), "or2" (nor2), "xor2" (nxor2), multiplexors, register, full adder carry computation and full adder sum computation. Note that the complement of a logic function is easily obtained by exchanging the ouput wires and does not requires the design of extra cells. We also added the cells needed for the design of the divider presented in section VI.
As well as these two sets of cells we also designed some cells very often used in the control parts of asynchronous circuits : Muller C element, generalized Muller C element, mutual exclusive element, QFlop, SR latch.
As the completion signal generation is a critical issue when designing self-timed circuits, care must be taken to the way of implementing it at the layout level. To strongly merge the completion signal generation in the cells, we designed a specific nand cell which can be inserted between a load and a functional tree when needed. Figure V .2 shows how a precharged functionnal block with a completion signal generation is designed. This strategy ensures that the completion signal is generated locally in the cell and hence that minimum parasitic capacitance is introduced. In fact, allowing the router to place the nand gate far away from the internal nodes, could at least lead to slower circuits and even to malfunction. 
B. The standard cell library and its environment
At this point we have three sets of cells, comprising about 30 different individual cells, dedicated to the design of asynchronous circuits. This is in addition to the CMOS standard cells available through the foundry. The two specific sets of cells, loads and trees, constitute a first level of our library which is not directly usable by the designers. Before designing a given circuit, the designer has to build a second level library made of precharged function blocks that can be built by associating loads and trees. Associating loads and trees requires that some precautions be taken in order to obtain good results in terms of speed and power consumption. In fact, loads and trees transistors sizes need to be matched.
In order to do so we set up an optimization procedure that gives the transistor sizes of the load and the tree in order to obtain the maximum speed for the minimum power consumption. This procedure is described in [24] . As the number of cells has to be kept as small as possible, only a small number of loads with different transistor sizes has been designed. The transistor size choices have resulted from extensive simulations and optimizations performed on the basic set of cells.
To illustrate the design strategy, consider the case of adding a new functional tree cell to the library. The tree structure is first designed using a standard procedure, for instance the tabular procedure proposed in [30] . The transistor sizes are then optimized, both for the tree and the load. This can actually be done with a standard load or a load with latch, since the load structure does not influence the optimization procedure. From this optimization, a load transistor size is attached to the tree, which specifies the optimized transistor size for the load to obtain the minimum power consumption. The designer then chose the load with the nearest transistor size available in the library and associates it with the functional tree designed. The diagram presented in Figure V .3 describes this flow.
Two tools have been designed to assist the designer in designing his/her own cells, the tree generation tool, which is based on the tabular procedure and the Quine-McCluskey approach [30] , and the transistor sizing procedure [24] . Neither of these procedures are automatic, but need assistance of the designer.
All the cells present so far in the library have been laid out by hand. No tool has been designed to perform an automatic layout generation. It also means that any extra cell that has to be added to the library must be designed by hand by the designers. 
C. The design flow
To design a complete asynchronous circuit using the library we developed, we proceed as follows. The operative parts are specified at a schematic level using precharged function blocks based on the DCVS Logic or LDCVS Logic. The control parts are specified using Signal Transition Graphs. They are then synthesized using a classical method [31] [32].
Simple controllers, such as the one involved in self-timed ring, are directly specified at the schematic level.
The circuit parts are then entered at a schematic level in the Cadence framework. The asynchronous cells not available in our library have to be designed as explained above. The circuit can then be simulated in the Cadence framework at a logic or electrical level. As soon as the circuit is valid it can be routed using the standard tools available in Cadence. The layout obtained is a mixture of asynchronous and CMOS standard cells. Post layout simulations can be performed to check timing assumptions and speed, for example. Figure   V .4 presents the whole design flow.
This standard cell approach to the design of self-timed circuits has been validated through the design of several chips that have been fabricated and tested. The fast parallelparallel multiplier reported in [24] is one example, as is the self-timed ring divider presented in section VI [34] . An asynchronous processor array is also currently being designed [29] . 
Control parts
STG
VI. Application to the design of a self-timed divider and square root extractor
A. Division algorithm and divider structure
The digit-recurrence approach of the division Q = A ÷ D relies on subtraction and on multiplication by the radix b and by the quotient digit q j
Step 0 (initialisation)
It is easy to verify that A = Q (j) * D + R (j) ∀ j. Each q j must be chosen to make R (j) converge to 0. In the present division, we use the radix 2 Signed Digit (SBD) number representation [36] , more precisely the Borrow-Save (BS) variant where each digit value is the difference of two bits. We now describe the notation and variables used in this section : To avoid a full comparison, the quotient digit q j is selected according to the sign of R (j-1) only [37, 38] 
Case 2 :R (j-1) < 0 ⇒ -2 -j+1 * D ≤ R (j-1) < 0 and R (j) = R (j-1) + D * 2 -j-1 Case 3 :R (j-1) = 0 ⇒ -2 -j+1 * D ≤ -2 -j+1 < R (j-1) < 2 -j+1 ≤ 2 -j+1 * D since 1 ≤ D < 2 and R (j) = R (j-1) . Each case leads to: -2 -j * D ≤ R (j) ≤ 2 -j * D and the convergence of the remainder towards 0 is proved. This proof is sometimes called the arithmetic condition.
C. Equations of the Head and Tail cells
* 2 -i + 2 -n because D is in two's complement notation. So the iteration R (j) = R (j-1) -q j * D * 2 -j becomes:
The head cell preserves the identity: Ŝ (j) = R (j-1) + q 
{only digits with weight 2 0 } Most of the previous implementations of self-timed division or division and square root extraction rely on three overlapped quotient selection stages [41, 42, 43] . This choice of several quotient selection stages results in far more complex head equations.
D. Range of the remainder
Since Ŝ (j) = 2 * ( s The logic equations for the head are consistent with the arithmetic operation that the cell has to perform. It is worth noting that while 1 ≤ D < 2, since implicitly we take d 0 = 1 in the head cells, there is no condition on A. A ≥ 0 since it is in unsigned binary notation. The fact that q 0 ( or Q (0) ) = 1 only implies that Q (j) ≥ 2 -j , since Q (j) is in redundant notation.
In the IEEE-754 mantissa application, 1 ≤ A < 2and 1 ≤ D < 2 gives 0.5 < Q (j) < 2.
E. On-the-fly conversion
The quotient Q (n) is produced in the form ∑ n i=0 q + i * 2 -i -∑ n i=0 qi * 2 -i . The conversion to conventional notation can be made by a subtraction after the completion of the division. On-the-fly conversion of the result presents three advantages: no extra conversion time, no subsequent step to link, square root extraction algorithm largely simplified [44] . But on-the-fly conversion costs a little more hardware than a simple subtraction.
At step 0 let U (0) = Q (0) = 1 and V (0) = 0 .
Suppose that at step j-1 we have : (j) and V (j) are updated from U (j-1) and V (j-1) according to q j , see Table VI .2. Table VI .2 Now at step j let Q (j) = Q (j-1) + q j * 2 -j . It is easy to verify by checking the table VI.2 for each of the three values of q j that: U (j) = Q (j) and V (j) = U (j) -2 -j . Developing the table VI.2 down to the logic level and introducing LSB (j) = 2 -j , we finally obtain : lsb 
F. Square root algorithm
The principle of square root extraction does not differ too much from the principle of division [45, 46] . It only supposes that the divisor is made identical to the quotient : Q = A ÷ Q.
The digit-recurrence square root approach relies on subtraction and multiplication by the radix b and by the quotient digits q j .
Step 0 initialisation) R (0) = A -1; Q (0) = 1
Step j (iteration) 1 ≤ j ≤ n R (j) = R (j-1) -2 * q j * Q (j-1) * b -j -(q j ) 2 * b -2j ; Q (j) = Q (j-1) + q j * b -j It is easy to verify that A = (Q (j) ) 2 + R (j) since (Q (j) ) 2 = (Q (j-1) ) 2 + 2 * q j * Q (j-1) * b -j + (q j ) 2 * b -2j and A = (Q (0) ) 2 + R (0) . The strong similarity between the square root extraction and the division algorithms leads to an implementation using the same head cell as for the division. In radix 2 the iteration becomes: R (j) = R (j-1) -2 * q j * Q (j-1) * 2 -j -(q j ) 2 * 2 -2j . Since q is in borrow-save we have: (q j ) 2 =  q j = q i the bits of LSB (j+1) = 2 -j-1 . We obtain :
In division we needed that d 0 = 1 i.e. D ≥ 1 for the head cell equations. For the same reason, in this part we require that both U (j) and V (j) ≥ 1. This is always true for U (j) since
In fact V (j) is used only when qj = 1, otherwise it is ignored. So we need V (j) ≥ 1 whenever q j = -1. Since A ≥ 1, q 0 = 1. The next time that q j ≠ 0, with j > 0, q j will again be equal to 1, and V (j+1) will get the value U (j) ≥ 1. By the way, this also means than the initialisation V (0) = 0 is not necessary. 
).
H. Borrow-save square root extraction convergence
Remember that R (j-1) = 4 * (s
. Suppose that at step j-1 we have -2 -j-1 * Q (j-1) + 2 -2j-2 ≤ R (j-1) ≤ 2 -j-1 * Q (j-1) + 2 -2j-2 , then 3 cases are possible for the next algorithm step:
R (j) = R (j-1) Each case leads to: -2 -j * Q (j) + 2 -2j ≤ R (j) ≤ 2 -j * Q (j) + 2 -2j and since Q (j-1) ≤ 2 -2 -j-1 , the convergence of the remainder R (j) towards 0 is proved. 
I. Range of the remainder
u i (j-1) v i (j-1) d i lsb i (j-1) t i (j-1) s i (j-1) u (j) v (j) d i lsbu (j) i = q - j .v (j-1) i ∨ q - j __ .u (j-1) i ∨ lsb (j-1) i-1 . ( q + j ∨ q - j ) ; v (j) i = q + j . u (j-1) i ∨ q + j __ .v (j-1) i ∨ lsb (j-1) i-1 . q + j __ . q - j __ ; lsb (j) i = lsb (j-1) i-1 ; t (j) i = s (j-1) i ⊕ t (j-1) i ⊕ ( q - j .( d i ∨ m . ( u (j) i ∨ lsb (j-1) i-2 )) ∨ q + j . d i __ ∨ m . (u (j-1) i ∨ lsb (j-1) i-2 ) _____________ )) ; s (j) i-1 = majority [s (j-1) i , t (j-1) i ____ , (q - j .(d i ∨ m.( u (j) i ∨ lsb (j-1) i-2 )) ∨ q + j .( d i __ ∨ m.(u (j-1) i ∨ lsb (j-1) i-2 ) _____________ ))].
K. Circuit architecture
To demonstrate the benefits of the pipeline scheme described when designing rings, and to validate our standard cell approach, a simplified version of the algorithm performing only the division on 32-bit numbers has been laid out and fabricated. The circuit is a selftimed ring assembled according to the theory proposed by Williams [28] . Following his methodology, an elementary stage was first designed using LDCVSL cells so that the forward latency Lf is minimum. An elementary stage with its local controller is illustrated in 
K.1. Timing assumptions
As shown in Figure VI .4, the acknowledge signal Ack j is issued from the first tail cell.
Sensing only the output of the first tail cell relies on the assumption that the Ack j signal is active only after the outputs of all the tail cells are valid. This assumption is realistic since the Ack j signal is generated at the end of the critical path and takes some time to be generated (a nand gate). Moreover, its effect on the next stage is delayed by a Muller C element ( Figure   VI .4 & VI.6).
K.2. Ring structure optimization.
From electrical simulations of a netlist extracted from the layout of a single stage ( Figure VI.5) , the values of "P" and "Lf" defined in [28] were computed. Each stage of the circuit has a forward latency of about 2.5 ns. The cycle time is about 7 ns. Then the optimum number of stages to be used in the ring is easily derived by computing P over Lf. For this circuit, the optimum number is then three. We applied exactly the same procedure to the design of the same ring divider, but based on DCVSL cells. As expected, the forward latency was found to be the same, but the cycle time was about 8.5 ns. The ring would have then required one more stage to compute at the same speed. This corresponds to a saving of one fourth of the ring complexity and clearly demonstrates the advantage of using Latched DCVSL cells. K.3. Self-timed ring structure Thus, the self-timed ring was built by assembling three stages (Figure VI.6 ). Self synchronisation in the ring is obtained through three looped Muller C elements (Figure VI.6), as described in section III.3.
The three-stage structure in conjunction with this simple control scheme ensures that the best possible throughput is achieved because under these conditions the ring operates with a single token in the Data-Limited region [28] . The token always flows forward into an unoccupied stage, i.e a stage that is holding a bubble.
K.4. Interface controller
An interface controller has to be added to the ring to cleanly initiate and stop the computation. Signal Req in tells the divider that new operands are ready, and Ack out signals the completion of the division (Figure VI.6 ). The interface controller CTRL was first modelled using a Signal Transition Graph, and then synthesized following a the method presented in [32] . Finally it was verified that the logic implementation of the controller was hazard-free.
When Req in is active, the controller allows the operand to be inserted in the ring via the input multiplexers, and then starts the computation using the Req en signal. The circuit is 
K.5. Completion detection
The computation end is detected by sensing the LSB of the on-the-fly converter. The converter, which both converts digits coming out the stages and memorize the converted values, is initialized with empty values when the computation starts. Thus, as soon as the least significant digit is filled with valid data, meaning the full result has been converted, the computation must be stopped. An internal signal is used to stop the ring selfsynchronisation, and the Ack out signal tells the outside world that the result is available.
Since at any time the partial result U (j) is correct, an early stop of the operation by a null partial remainder R (j) (i.e. S (j) = T (j) ) is possible with this architecture.
L. The chip fabricated
The layout of the 32 bit self-timed ring divider is presented in Figure VI So far, the methodology presented requires the circuits to be specified at a schematic level. Some work have still to be done to link the library, and even the sublibraries, with a high-level description language. Moreover, the methodology adopted is based on two different chains of tools to design the operative and the control parts, which makes a synthesis tool more difficult to design. Fortunately, asynchronous circuits are very modular and easy to build, connecting small parts together. Hence, such a methodology can still be applied to the design of large chips, provided that modularity is clearly identified and small subcircuits defined. However, given the progress that is continually being made in the areas of software engineering, and the computing power generally available, a parallel CAD activity might be expected to be very fruitfull.
