A new method for designing asynchronous circuits is described. It utilises additional circuitry to monitor the activity of internal nodes. When all transitions have halted a completion signal is generated. Details of the circuit and design methodology are given. The proposed approach results in faster operation than synchronous circuits with only a small circuit overhead incurred.
Monitors (AM's) are applied to internal signals of the CL. The circuit of an AM has one input which is connected to the signal that has to be observed, and one open collector (OC) output. The function of an AM is to generate a pulse, i.e. a high-low-high transition on its OC output when a transition of the input signal occurs, regardless whether the input signal is a high-low or a low-high transition. If no transition of the input signal occurs the OC output remains in a high impedance state.
An AM can be attached to every internal signal of the CL. Their outputs have to be connected to a common signal ACT ( Fig. 1 ) which is fitted with a pull-up resistor and can be used to determine whether the CL is in a transient or steady state.
The transistor level implementation of an AM as drawn in Fig. 1 has been found to be an optimal solution with respect to symmetrical operation, speed and area required. INV1 and INV2 have to be weak inverters to guarantee a sufficient duration of output pulses. The repercussion on the signal to be monitored, imposed by additional capacitance, is kept to a minimum since only three additional transistors have to be driven. A total of eight transistors are needed for one AM.
Attaching one AM to every internal signal of a CL can result in a substantial increase in circuit area.
Alternatively AM's can be attached to certain exposed signals only. In order to guarantee reliable operation the following conditions must be fulfilled.
1. To attach AM's the CL has to be subdivided into non-overlapping subcircuits. Each signal which connects two different subcircuits has to be fitted with an AM. The group of all AM's which are attached to the outputs of one subcircuit is called a vector of AM's.
2. In order to ensure the safe overlapping of output pulses of AM's, and hence avoid spikes on the common output signal ACT, the pulse width (t p ) generated by each vector of AM's when detecting a signal transition must exceed the delay of the critical path (t crit ) of the subsequent subcircuit plus the switch-on delay (t A ) of the subsequent vector of AM's.
3. To cover the cases when no transitions occur at all (state similarity) or when transitions occur only in the first subcircuit connected to the primary inputs, a minimum delay generator (MDG) is needed [3] . The MDG is activated when a new data token is applied. The pulse width generated by a MDG (t MDG ) must exceed the critical path of the first subcircuit plus the switch-on delay (t A ) of the first vector of AM's.
For a ripple-carry adder an appropriate place to attach AM's is to each carry signal (C1, C2, ...) as shown in Fig. 1 . Here the vector of AM's degenerates to a single AM. Each one-bit full adder represents a subcircuit. The pulse width generated by each AM (t p ) when detecting a signal transition must cover the critical path of a one-bit full adder plus the switch-on delay of the subsequent AM. In practice, as with synchronous circuits, an up to 100% safety margin may need to be added to the nominal value of t p. The signal A0 shown in Fig. 2 acts as an example for the applied input vector. The measured switchon delay for the signal ACT is t A = 0.9 ns and the delay of the critical path of a subcircuit is t crit = 1.4 ns. The rising edge of signal ACT can be used to trigger the transport of a processed data token to subsequent processing stages.
RESULTS
In order to study the approach in more detail a transistor level implementation of a 4 by 4-bit parallel multiplier has been used. The multiplier comprises a regular array of cells which compute the product of appropriate inputs (P) and a number of adders (A).To investigate the effects of parasitic capacitance, including estimates of wire capacitances, an analogue simulation of this circuit has been carried out, again using the spice parameters of a 1.2 µm CMOS process. The circuit has been divided into a number of subcircuits by inserting vectors (columns) of AM's as shown in Fig. 3 . Table 1 compare the performance of the 4 by 4-bit multiplier using AMCD to an identical synchronous circuit. A 100% safety margin has been allocated to both the synchronous circuit and the subcircuits using AMCD. The average speed of the AMCD version is substantially increased by 37% and the worst case speed is still 9% higher than the synchronous counterpart. The additional area required for the AM's and their interconnection is relatively small at 12%. This additional hardware causes an almost proportional increase in energy consumption (E) of 11%. At the system level, the increase in energy consumption will be more than compensated by the abolition of the clock signal.
CONCLUSIONS
The feasibility of a new method called AMCD to achieve completion-detection for self-timed circuits has been demonstrated. The major advantages of the proposed method is that data dependent delays of CL's can be exploited whilst imposing only marginal repercussions on the CL itself. An analogue simulation of a 4 by 4-bit multiplier using models of a commercially available CMOS process has demonstrated the practicality of the approach. The results obtained clearly demonstrate the benefits of the method. 
