This paper presents a general method for designing delay insensitive datapath circuits. Its emphasis is on the formal derivation of a circuit from its speci cation. We discuss the properties required in a code that is used to transmit data asynchronously, and we i n troduce such a code. We i n troduce a general method in the form of a theorem for distributing the evaluation of a function over a number of concurrent cells. This method requires that the code be distributive." We apply the method to the familiar example of a ripple-carry adder, and we g i v e a CMOS implementation of the adder.
Introduction
A circuit is said to be delay-insensitive when its correct operation is independent of the delays in the operators and in the wires, except that these delays are positive and nite. Obviously, s u c h circuits do not use clocks for the sequencing of actions, and are therefore a special class of asynchronous circuits. Delay-insensitive circuits are interesting for two main reasons: First, they are more robust and potentially faster than their clocked counterparts, since their correct operation does not rely on worst-case delay assumptions. The speed advantage will be clearly demonstrated by the ripple-carry adder example, where we exploit the variation in carry-chain lengths to reduce an algorithm that is linear in the worst-case assumption, to an algorithm that is logarithmic in the average case.
Second, delay-insensitive circuits are more suitable for formal treatment since they can be designed and analyzed entirely within the algorithmic domain, up to electrical optimizations like transistor sizing. A delay-insensitive circuit can be formally derived by programtransformation from a high-level" program description. If the original program has been proven correct, the resulting circuit will be correct by construction. For a description of the method, see, for instance, 4 , and 5 .
In spite of the intense activity in the area of high-level synthesis of delay-insensitive circuits, most published research so far has concentrated on the design of control circuits," i.e., circuits that realize the sequencing of actions of a computation. The other type of circuits, called data paths," are those that deal with the manipulation and transmission of data.
Datapath design raises issues very di erent from, and in several respects more di cult than, that of control circuitry. First, for reasons of e ciency, all sequencing circuitry should be eliminated from the datapath implementation. Second, the evaluation of a function should be distributed. Ideally, w e w ant e a c h bit of the output to be produced by a cell" that depends only on a limited number of bits of input. All cells operate concurrently.
This paper presents a general method for designing delay-insensitive datapath circuits. Its emphasis is on the formal derivation of a circuit from its speci cation. We rst discuss the properties required in codes used to transmit data asynchronously between two concurrent processes, and we i n troduce one of these codes. We then introduce a general method in the form of a theorem for distributing the evaluation of a function ove r a n umber of concurrent cells this method requires that the code be distributive."
Next, we apply the method to the familiar example of a ripple-carry adder. This example uncovers another di culty of datapath design: In order to reduce the fanin of each cell, some information computed by one cell is used in another cell|the carry in the case of an adder. But this extra communication may reduce the concurrency between cells.
We therefore introduce and apply optimization rules that reduce the dependencies between input and output.
Finally, w e s h o w h o w a monotonicity p r o p e r t y of guard evaluation called stability" makes it possible to implement the nal program directly as a transistor network in CMOS. This mapping is particularly e cient since, unlike earlier implementations of the adder, it does not require translation to standard cells.
The paper is reasonably self-contained: The whole design of the adder from program to CMOS circuit is explained and justi ed.
Delay-insensitive Communication
Consider a system consisting of two communicating processes: a producer sender of data words, and a consumer receiver of the data words. The data words are binary encoded and transmitted on a set of wires. For the purpose of making this paper self-contained, we view a wire shared by t wo processes as being a boolean variable assigned by o n e process and read by the other process. There is an important restriction, however, to the use of wires as program variables: Because of the delay-insensitive nature of the transmission of data, the order in which the wires of a set are assigned by the sender cannot be maintained on the receiver side they can be observed by the receiver to change value in any order.
Because signals assignments to wires cannot be ordered, it is impossible to use an extra signal|clock o r c o n trol signal|to encode the information to be used by the the receiver that the set of data wires contains a valid value. Instead, this information has to be encoded in the data that is transmitted between sender and receiver.
Delay-insensitive c o d e s
Let us discuss rst the transmission of one data word. The sender assigns values to all the wires concurrently, since order is irrelevant. The receiver reads the data wires in any order or concurrently. Concurrent reading and writing of a wire is possible: We m a y assume without loss of generality t h a t t h e v alue read is either the old or the new value. Concurrent writes are not allowed.
Let B be the set of data words to be transmitted. A data value to be transmitted is encoded using the coding function C : B 7 ! X : Set X is the set of all code words. Let V be the set CB. V is called the valid set or the set of valid values.
The code has to be chosen such that there is a non-empty set, N the neutral set or the set of neutral values, such t h a t N X V :
Hence, a code value cannot be both neutral and valid. Fo r a c o d e word X, the predicate vX stands for X is a va l i d c o d e w ord." The predicate nX stands for X is a neutral code word." The code has to be chosen such t h a t :
Property 1 For any code word X : :vX _ : nX: Furthermore, j X j j B j : Typically, e a c h data wo r d i s a n a r r a y o f n booleans, and each c o d e w ord is an array o f m booleans, with m n .
The transmission of a data word, B, b y the sender is the assignment o f a v alid code word, X, to the set of wires such t h a t CB = X: If the assignment also implies that the wires change from a neutral value to a v alid value, we can construct a communication protocol in which t h e receiver can detect that the value read on the wires is the data sent b y observing a change from a neutral value to a valid value. Once a valid value has been assigned to the wires, sending the next code word requires either that all wires rst be reset to a neutral value or that the coding function, C, b e c hanged such that the nal, valid, value of any communication can be interpreted as the initial, neutral value of the next communication.
The rst solution is a straightforward extension of the four-phase handshake protocol the second solution is a straightforward extension of the two-phase handshake protocol. Since we usually prefer to use a four-phase protocol, we c hoose the rst solution in this paper. The extended four-phase protocol between the producer and the consumer can be described as follows:
producer ci produce X X * :ci X + consumer nX ci" vX consume X ci# Initially, :ci^nX holds.
The general notation used is explained in the appendix. X * is the concurrent assignment of some bits of X such that the result is a valid value, and X + is the concurrent assignment of some bits of X such that the result is a neutral value.
In the consumer, the test vX is needed to guarantee that the consumed value is a valid value, and the test nX is needed to guarantee that the next valid value produced by the producer is separated from the previous one by a n e u t r a l v alue. 
Intermediate Values
We require that assignments X * and X + each contain at most one assignment to each boolean variable x of X an elementary assignment". Since any v alid value is distinct from any neutral value, the assignment X *, which realizes the transition from a neutral value X n to a valid value X v , contains at least one elementary assignment. If X * contains more than one elementary assignment, the set of elementary assignments of X * can be partitioned into two non-empty subsets, S1 and S2. The set S1 realizes a transition from X nt o a v alue Z Observe that a code word, X, for which xt k^x f k holds for some value of k is neither valid nor neutral, and is therefore not in X .
Proof of Property 1 By the de nition of nX a n d vX, we h a ve: nX : vX which establishes Property 1 . Proof of Property 2 Since the code contains only one neutral value, no downward intermediate value is neutral. We p r o ve that an upward intermediate word, Z, is not valid. Because of the coding 1, any v alid dual-rail code word di ers from the neutral word in exactly N bit positions. By de nition, Z di ers from the neutral value in a number, m, of bit positions equal to the size of S1. Hence, m N , and Z is not valid.
One-Hot Code
Another commonly used delay-insensitive code is the so-called one-hot code. For a data word B of n bits, the one-hot code X is the word of 2 n bits with exactly one bit true in the position corresponding to the decimal value of B. W e h a ve: Since X * and X + both contain exactly one elementary assignment, no intermediate value can be generated thus, the one-hot code is separable.
Function Evaluation
We w ant to construct a process, F , that repeatedly takes separable code word, X , and produces a separable code word, Y , such that Y = f X for a given function f . The process behaves as both the consumer of argument X and the producer of the result Y . Combining the two protocols gives the function-evaluation process However, the one-hot code is not distributive. 6 The Main Theorem Next, we s h o w h o w to implement the function evaluation process, F , with a set of concurrent cells, each dedicated to assigning one bit of the function. We present the result in the form of a theorem. Although the method is applicable to all distributive c o d e s , w e p r o ve the theorem for dual-rail codes.
We rst distribute the dual-rail input code word X in the following way: We construct a set of N subcodes, W k , with 0 k N , where N is the size number of bits of the data output. The construction of the code follows the two rules introduced in the proof of the previous theorem. Hence, we h a ve 8k :: vW k vX 6 8k :: nW k nX 7 We add one extra requirement: Let S k be the set of bits of X used in B t k and B f k . W e require that W k be chosen such t h a t S k W k .
Hence, the function evaluation can be distributed only if the algorithm used for evaluating the function satis es the locality property that the number of bits of X used in B t k and B f k is signi cantly smaller than N : This extra requirement ensures that the validity o f W k implies the validity of the bits of X used in B t k and B f k .
With this distribution of the input code, X , w e will establish Proof We are going to produce the solution by successive program transformations.
The function evaluation process, F , and the environment, E, share variables in a restricted form. Process F sets the output variables, Y , and observes the input variables, X . Process E sets the input variables, X , and observes the output variables, Y . The correctness of any implementation relies on an important p r o p e r t y of the guard evaluations, called stability.
De nition 2 Stability Let G be a guard c ontaining shared variables assigned by another process. The evaluation of G is stable if, once G is evaluated t o t r u e , i t r emains true at least until the process containing G changes some variable.
Theorem 3 Allguards are stable in the initial version of F and in E.
The proof is immediate from the properties of a separable code.
We shall maintain the stability of the guards as an invariant o f a l l further versions of F.
We can now i n troduce and justify the successive transformations of F. In the proof, the range of k is from 0 to N 1 and is omitted. The net e ect of the assignment, Y j *, is not changed either. This assignment depends only on the validity of the variables in the set, W j .
Since all bits of X are assigned concurrently by the environment, if vX k is true in a state of F, w e can conclude that eventually, vX j will hold and similarly for the downgoing transitions. Hence, the assignment, Y j *, will be correctly executed in the new program.
The other half of the transformation is justi ed in the same way. Now, F has the structure: However, T 0 j is conditional to nW j holding. And the environment establishes nW j as a result of X +, which is conditional to vY holding as a postcondition of the preceding T k for all k. Hence, the sequencing between a T k action and the following T The other half of the proof is similar. We h a ve also established that the sequential composition between T k and T 0 k inside the same cell i.e., for the same value of k is also super uous, which justi es the next transformation. We eliminate the last semicolon by m o ving the test, vW k , inside the guard of the selection command. We g e t :
Bt k^v W k ! yt k " Bf k^v W k ! yf k " k nW k ! yt k #k yf k # :
This transformation is valid if we assume that the implementation of the guard evaluation uses the same value of X for both vW k and Bt k in the rst guard, and the same value of X for both vW k a n d Bf k in the second guard. This requirement i s r e l a t i v ely easy to meet in VLSI, but we will not elaborate any further, as we can justify the transformation in another way|thanks to a property o f Bt k and Bf k that we will introduce for optimization purposes. Proof It is obvious that A and B being mutually exclusive i s a n e cessary condition for the equivalence of the two programs.
Assume that A and B are mutually exclusive. Any nite execution of either program is an interleaving of a nite number of executions of A and B. An execution of A or B is a step of the interleaving."
Assume that the two i n terleavings are identical up to and excluding the n-th step, n 0. Since the selection command is deterministic, the nth step is unique, and is therefore identical for both interleavings.
This completes the proof of the main theorem.
Corollary 1 All guards of a cell are stable.
Binary Addition
As an example of an application of the method, we will now implement the process, F, whose function, f, is the addition of two N-bit integers, A and B. The output is an N + 1-bit integer, S. We w ant to select an algorithm for binary addition in which the functions, Btand Bf, as introduced in the previous sections, depend only on a few bits of A and B. Ripple-carry addition" is such an algorithm.
Ripple-Carry Addition
The value of bit s k of S can be expressed as a function of bits a k and b k of A and B, and of the carry-in bit, c k . More precisely, t h e postcondition of the addition can be expressed as: 
Magic" Inputs
First, let us assume that the carry-in bits are provided by magic" by t h e e n vironment as normal inputs, and that each cell computes its carry-out, d k , as a normal output. We can then apply our main theorem and construct an adder as the concurrent composition of N adder-cells.
The inputs, A B and C and the outputs, S and D, are dual-rail encoded: To bit a of data input A correspond bits at and af of the dual-rail code and similarly for the other inputs and outputs. For the construction of a generic adder-cell, add, w e can omit the subscript k. The guards, B t and B f, of the commands that set the two o u t p u t bits to true in the main theorem have to be replaced with two sets of guards, as we h a ve t wo di erent output bits per cell.
Guards S t and S f are used to assign bits st and sf , r e s p e c t i v ely. Guards Dt and Df are used to assign bits dt and df , respectively.
We h a ve: For dual-rail codes, this expression can be simpli ed as vx = xt _ xf since xt^xf never holds.
Eliminating the Magic
Since all input transitions are delay-insensitive, we can restrict the magic" to producing a valid input, c k+1 , only after output d k is valid, and to producing a neutral input, c k+1 , only after output d k is neutral, for 0 k N 1. The environment originally produces input, c0, which i s f a l s e . The solution is still correct although the concurrency between cells has been restricted.
Next, we observe that since, for k N 1, the valid value of d k is the same as the valid value of c k+1 , w e can eliminate the magic and The program of a cell is:
The solution obtained is completely sequential since the validity o f s k depends on the validity of the carry-in, c k . The solution can be greatly improved by reducing these dependencies and simplifying the guards.
Optimization
An important property of the dual-rail code is that the tests vW k can be simpli ed and often even eliminated. Simplifying or eliminating these tests may eliminate some of the sequential dependencies between the validity of an input and the validity of an output, hence reducing the number of steps required to compute the function in the average case.
We will also simplify the remaining expressions. These transformations will reduce the number of conjuncts in boolean expressions, hence reducing the number of transistors in series in a pullup or pulldown chain of a CMOS inplementation. In the worst case, the switching delay is quadratic with the number of transistors in series.
Simplifying the Validity Conditions
The validity t e s t s , vW k , can be simpli ed by application of In other words, if a guard B tor B f of a cell is true, the inputs used to established the truth of the guard are valid and thus the guard is stable. As was suggested earlier, Transformation 6 can be justi ed by means of this property of dual-rail codes.
Validity o f T ransient I n p u t s
Although all guards B t or B f of a cell are stable, we cannot always eliminate the validity tests altogether, because of the possible existence of so-called transient inputs.
It may occur that, for some value of the inputs, some input bits are not used to establish the validity o f t h e o u t p u t Y , and therefore the function-evaluation process can complete the handshake protocol without waiting for these input bits to be valid. However, we h a ve t o see to it that those input bits still go through the valid neutral cycle before they are used in a subsequent function evaluation. Such input bits are called transient inputs. Let us look at a simple example.
The function to be implemented is the AND-function:
The dual-rail translation of this program gives:
We o b s e r v e that because of the disjunction in the second guard, both a and b are transient inputs. Hence, the second guard has to include the validity test for the transient i n p u t s .
Simpli cation of the Adder
We rst eliminate the validity tests from the guards. We then simplify Dt and Df. Finally, w e c heck that there is no transient input in the new guards. We l e a ve it to the reader to verify that Dt at^bt _ dif a b^ct can be simpli ed as:
Df can be simpli ed similarly as af^bf _ af _ bf ^cf :
We cannot simplify the expressions for S t and S f. With this new set of guards, we c heck that all inputs are used in S t and S f, i.e., st _ sf va^vb^vc holds, and thus there is no transient i n p u t .
A Graphical Analysis
A graphical analysis can be helpful in identifying the transient i n p u t s and at the same time in evaluating the e ciency of the algorithm. In the case of the adder, we construct the following graph: To e a c h cell correspond four nodes in the graph|one for input c, one for inputs a and b together, one for output s, and one for output c. Figure 1 shows the dependency graph for three cells.
The graph shows that the validity o f a, b, a n d c is required for s and d to be valid. Each directed path from an input to an output indicates that the validity o f e a c h node but the last one on the path is required for the next node on the path to be valid. Hence, the length of the longest path gives an upper bound of the number of steps necessary to compute the outputs.
An inspection of the graph shows that the longest path is proportional to the largest numberof contiguous cells with the arrow from c to d|the dotted arrow|present. Hence The number of steps required to compute the output of the ripple-carry adder is proportional to the maximal numberofcontiguous binary positions in which one input bit is di erent from the other. The assignments of Y + are unconditional: All variables of Y are reset to the neutral value. We can therefore distribute the test nX i n a n y way w e w ant. In the case of the adder, we can split the test nW into na^nb on the one hand, and nc on the other hand. We can associate either guard with the transitions st # s f # or dt # d f #. The two c hoices are expressed in the dependency graphs of Figure 2 , in which a n a r r o w from x to y means that the neutrality o f y depends on the neutrality o f x. It is clear that the solution of Figure 2a is more e cient since all paths have constant length. This choice corresponds to the guarded commands: Hence, the whole adder can be implemented directly in CMOS without further transformation into standard cells."
To the expression, B, corresponds a series-parallel switching network, N B. Each switch is implemented with an n-transistor or a p-transistor whose gate is a literal of B. Hence, the predicate, there is a conducting path between the two terminal nodes of N B, has the same value as B. We limit ourselves to two t ypes of switching networks: A pullup" circuit has for terminal nodes the high-voltage constant, VDD, and the output node, x, of the program. A pulldown" circuit has for terminal nodes the low-voltage constant, GND, and the output node, x, of the program. Hence, a pullup circuit implements the program B ! x " , and a pulldown circuit implements the program B ! x # . For reasons of e ciency particular to the CMOS technology, w e restrict a pullup circuit to containing only p-transistors, and a pulldown circuit to containing only n-transistors. A p-transistor is a conducting switch when the gate voltage is low an n-transistor is a conducting switch when the gate voltage is high.
Hence, we c a n c hoose to implement the rst four guards of the adder-cell as pulldown circuits since they do not have i n verted literals, and the last two guards of the adder-cell as pullup circuits since they have o n l y i n verted literals but then, all ouputs of the cell are inverted.
Adding an inverter to each output is expensive since the carry chain may include up to N inverters in series in addition to the N carry gates. A better solution is obtained by alternating cells that produce negated outputs|the even-numbered bits|with cells that produce straight outputs|the odd-numbered bits.
A CMOS implementation of a cell with inverted outputs is shown in Figure 3 . The only noticeable disadvantage of this design is the long pullup chain 4 transistors for the carry circuitry. W e can reduce the length of these pullup chains from 4 to 3 by distributing the neutrality t e s t e v en more evenly. For instance, we c a n c hoose the following distribution: The transistor count per cell is 34. If one includes the inverters needed to invert the inputs and the outputs of every other cell, the transistor count is 42, as compared to the 40 transistors needed for an equivalent no pass-transistors cell design in clocked logic. Hence, contrary to common belief, the asynchronous solution is hardly larger than the clocked one, in spite of the use of dual-rail logic.
In evaluating the performance of the adder, it is important to realize that only the transitions from neutral to valid values are critical in the type of protocol lazy-active used. From equation 3 describing the environment protocol, we see that the environment consumes the result, Y , and produces the next output X before testing that Y has been reset to the neutral value by the function-evaluation process. Hence, the resetting of Y to the neutral value is not on the critical path.
As we h a ve seen, the length of the longest carry chain is proportional to the maximal number, n, of contiguous binary positions in which one input bit is di erent from the other. In the HP CMOS 40 process provided by MOSIS 1.6 micron feature size, the delay in nanoseconds for an addition is 6 + 1:2n 1: This delay includes the completion-tree delay required for the environment to detect the completion of an addition. It is usually believed that, statistically, n is about logN:Hence, for N = 32, an adder delay is about 11 nanoseconds in the average case.
If we had to adjust the delay t o t h e w orst case, as is required in clocked logic, we w ould have to stretch each addition delay to accommodate the delay corresponding to N = 32, i.e., 40 nanoseconds, or four times the average delay!
Comparison to the similar adder designed by C.L. Seitz in 7 seems unavoidable. Seitz's adder cell contains more than 100 transistors, without counting the inverters. Hence, it is about three times larger, and also three times slower, than the adder cell presented here.
Conclusion
We h a ve presented a method for the formal derivation of asynchronous datapath functions. First, an algorithm with reasonable distributive properties has to be chosen for the function evaluation and, for that matter, ripple-carry is not the only choice for the adder. After that choice has been made, the rest of the derivation is almost automatic. Apart from some simpli cation of the guards, which can be important, the main decision left to the designer is how to distribute the validity test for the transient inputs, if any, and the neutrality test.
In the method presented, the validity and neutrality tests are included in the evaluation of the function output variables. Another, quite di erent, approach i s t o k eep the function evaluation proper separate from the validity and neutrality tests, and to perform them concurrently.
For the method used, dual-rail coding is almost ideal because of its distributivity p r o p e r t y. Other codes may be better suited for the alternative m e t h o d m e n tioned. The adder described here has been used in a slightly di erent form the inputs A and B are not dual-rail encoded as they are part of the same process as the adder as a basis for the di erent asynchronous arithmetic units in the Caltech A s y n c hronous Microprocessor 2 . The performance of the ALUs in general has been surprisingly good 3 .
G , where G is a boolean expression, stands for G ! skip , and thus for wait until G holds." Hence, G S" and G ! S are equivalent.
S stands for repeat S forever." Hence, the operational description of the statement G 1 ! S 1 : : : G n ! S n is repeat forever: wait until some G i holds execute the S i for which G i holds."
