One of the most frequently used primitives in asynchronous control circuits is the C-element. The three most popular CMOS implementatiosns of the C-element are compared with respect to energy-efficiency, delay, and area, with an emphasis on energ,y. The three implementations have been introduced by Sutherland, Martin, and Van Berkel. We show that in a typical environment, Van Berkel's implementation is superior to the other implementations with respect to energy consump tion and area, for the same delay.
Introduction
Various applications have demonstrated that asynchronous circuits have great potential for low-power and high-performance design. One of the most frequently used primitives in asynchronous control circuits is the C-element. The purpose of the paper is to compare the three most popular CMOS implementations of the C-element. One implementation of the C-element has been introduced by I.E. Sutherland [l] and is used often in high-performance micropipelines; a second implementation has been introduced by A.J. Martin and has been used in the Caltech asynchronous microprocessor [2] and an asynchronous, low-power version of the ARM developed at Manchester University [3]; a third implementation has been introduced by K. Van The different implementations are examined in two different setups. For each setup, we measure the delay and energy dissipation through HSPICE: simulations.
The C-element
The C-element has been introduced by D. E. Muller [6] and is therefore also called the "Muller C-element." A C-element has two inputs a and b and one output c. Traditionally, its logical behaviour has been described as follows. If both inputs are 0 (1) then the output becomes 0 (1); otherwise the output remains the same. For the proper operation of the C-element, it is also assumed that once both inputs become 0 ( l ) , they will not change until the output changes. A state diagram is given in 
Measurement Setup
Two testing environments are considered: one to evaluate the effect of the input capaciiance and driving ability of an individual C-element gate, and another one to evaluate the performance of a series of cross-coupled C-element gates, where the performance of individual elements are mutually dependent. 
D = (2)
The energy consumption of the C-element per output transition also depends on the arriving times and order of the inputs. We measure the average energy consumption per output cycle E , corresponding to the illustrated waveforms, from the following.
T 3
Figure 3: Second measurement setup: testing for optimal sizing of a chain structure in presence of feed-back
The results obtained through the first setup are useful when we want to insert a C-element in a spot where the input drive and output capacitance are known. The second setup, however, can give suggestions for the implementation of a subsystem in which the C-elements are mutually dependent. The technology file of a 0.8
and T is the total period of the output waveform. The HSPICE simulations for this setup were performed at a frequency of 40 MHz.
--The second measurement setup is shown in Figure 3 . The C-elements in the chain are all of the Same size. This setup differs form the first setup in that the performance of the C-elements are now mutually dependent, which affects the overall performance of the chain. A simulations. bubble at the input of a C-element schematic means that an inverted version of that input must be used.
pm BICM& process has been used in our HSpICE simulations. In this technology, the threshold voltage of NMOS devices is 0.81 v and that of PMOS devices is -0.90 V. Voltage Power supply of 3 V is assumed in all This, of course, can be implemented using an inverter.
The chain of the C-elements shown, without the inverters at the two ends, form the control circuit of an nstage micropipeline. The inverters are added to make the micropipeline self-driven and oscillating. The signals at the nodes indicated by ~( i ) , where i represents the stage number, can be interpreted as "request for the next stage". Initially, all nodes are set to low, and the only possible event is the rising of ~( 0 ) . This logic u~n e 7 7 then propagates through all the request nodes. Meanwhile, other transitions are produced at r ( 0 ) which propagate toward the end, in turn. If not interfered, the oscillation of the nodes continues forever. A much simplified waveform is shown in the same figure. The parameters of interest in this test are the latency L , throughput, and energy per throughput. The frequency of oscillation of the micropipeline F , which is half its throughput, is the inverse of a. The energy dissipation thus, is
C-Element Implementations
A conventional pull-up pull-down realization of the function is shown in Figure 4 . This circuit has been presented in [l] by Sutherland. This implementation is ratioless, i.e. it does not impose any restrictions on the sizes of the transistors. From the operation of the circuit, we conclude that N1, N2, and N6 are the main pull-down transistors which contribute to output switching; they are of size W . Whereas N3, N4, and N5 only provide the necessary feed-back to hold the state of the output when values of the inputs do not match; hence, they are made as small as possible to reduce their loading effect. Similarly, the feed-back transistors P3, P4, and P5 have minimum width, while P1 and P2, the normal pull-up transistors, have widths W, = 2.5W.
The C-element circuit illustrated in Figure 5 has been presented by Martin [7] . This circuit utilizes an inverter latch to maintain the state of the olutput when the inputs do not have a similar logic level. The circuit suffers from a race problem a t node E. The.re is an inherent re-", , sistance to switching the state of th.e latch that can be reduced, but can not be eliminated. For a proper operation of the circuit, certain size ratios must be imposed on the transistors. The feed-back inverter should be a weak one to allow changes in the: state of the latch. Accordingly, the following must hold:
(5)
Minimum size transistors are chosen for the feed-back inverter to reduce the race problem. The next C-element implementation by Van Berkel [8] is shown in Figure 6 . In this circuit, the output state is maintained through a feed-back conducting path of three transistors in the pull-up tree or the pull-down tree. Similar to Sutherland's circuit, this circuit is also ratioless. An advantage of this implementation is that it is symmetrical with respect to the inputs. For the circuit to have the same pull-up and. pull-down resistances (when switching) as the previous two implementations, the normal N-tree and P-tree transistors, except those of the output inverter, must be made half the size. The feed-back transistors N3 and P3 are, as usual, of minimum size, and N6 and P6 lhave a normal size to achieve load driving capability of the previous circuits.
5 Results of the Tests Figure 7 shows the energy dissipation versus the propagation delay for the three C-eleiment implementations with a fan-out of 3 inverters under the first test. The size of the C-element gate increases for each curve from the right hand side of the graph toward the top of the graph. Thus, one might get two different energy readings for the same delay and implementation, but they correspond to two different sizes. Po.int "M:" in the figure shows 
Propagation Delay @Sec)
Figure 7: HSPICE simulation results, energy versus delay, for the C-element gates in the first test setup the minimum delay achievable using the second (Martin's) implementation of tlhe C-element. For the same delay, if we chose the first (Sutherland's) implementation we would be saving 42% in energy. Furthermore, if we chose the third (Van IBerkel's) implementation, we would be saving 58% in energy over Martin's circuit. Similarly, point "S" shows the minimum obtainable delay using Sutherland's implementation. Van Berkel's circuit obtains the same delay for 45% less energy . Notice also that delays between 1.15 nSec to 1.27 nSec can be achieved by Van Berkel's circuit but not by the other two circuits. Although the minimum delays of the circuits are all within a 10% difference, Van Berkel's circuit achieves a significant energy savings of about 50% for the same delay.
The results of the simulations for the second test environment are depicted in the form of an energy-frequency (E-F) graph in Figure 8 Although results may vary for other test environments, we can say that Van Berkel's C-element is definitely the right choice for low-power and highperformance applications. The superiority of Van Berkel's implementation can be explained intuitively as follows. Sutherland's implementation has six transistors dedicated to latching, which do not resist the output switching of the circuit. The second implementation has only two transistors dedicated to latching, but the topology is such that they do resist output switching. Van Berkel's implementation has the advantages of both: minimum overhead for latching with no resistance against output switching. Furthermore, Van Berkel's implementation has a nice symmetrical topology with respect to the inputs.
The results obtained through this study are not limited to the C-element and may be extended to similar circuits in which the state of their outputs must be latched. For example, this comparison demonstrated that minimizing the number of transistors dedicated to latching and avoiding topologies that resist output switching may result in significant energy savings.
