Abstract
Introdluction
Handshake circuits form the intermediate representation in the fully automatic compilation of Tangram programs to Vl,SI circuits [14] . VLSI programs written in Tangram are translated in a transparent way into handshake circuits, which are subsequently, on a componentby-component basis, replaced by gate netlists. The datapaths of these circuits have previously been implemented using double-rail data encoding. With this approach a number of interesting circuits have been realized, such as the chips folr the error decoder of the Digital Compact Cassette plqyer [15] . Double-rail implementations of handshake circuits are robust, but they have two important disadvantages. First of all the readizations are rather area inefficient. Since for each bit two wires are used in the encoding, compared to one wire for synchronous circuits, the area of a double-rail circuit is about twice that of an equivalent synchronous realization. (Each wire has to be driven by some cell, and hence implies transistors and circuit area.) Secondly, double-rail combinational logic is generally implemented using dedicated double-rail cells, not normally available in a standard-cell library.
The goal set for the work described in this paper was to reduce the area overhead of handshake circuits and, simultaneously, to map them onto a generic standard-cell library. Both points are essential when striving for acceptance, application, and production of asynchronous circuits on a larger scale. We want to leave handshake circuits as they are, so the compilation from Tangram to handshake circuits should be independent of the style of implementation. We fur- The Netherlands control, and to self-initializable realizations. An obvious candidate for area reduction is the doublerail datapath. Single-rail data encoding [113 uses only one wire per bit, plus one additional wire to signal the validity of the data. This form of data encoding is also known as bundled data 1131 and was already applied in the sixties in the Macromodule project [12] . EspPKially for wide datapaths, single-rail encoding should leaJ to a reduction in the number of wires and transistors, and thus to a smaller area. Of course this comes at some price such as new test challenges and the need for more verification.
An important advantage of single-rail datapaths is that datapath operators such as adders and exclusive ors can be found in any generic standard-cell library. This motivates our choice to try and implement all handshake components, control and data, using a library of common standard cells ((n)ands, (n)ors, inverters, x(n)or, and some complex gates) only.
The control path is also a potential source for area inefficiencies. One remedy there is to identify frequent occurring combinations of handshake components and substitute these by more economic implementations. This is not further addressed here. Some ideas may be found in [I, 9, 161. One may observe that low power or high speed are not identified as primary goals in this exercise, althuugh low power is one of the main motivations for looking at asynchronous circuits. The reason for this is that smaller area is the main concern, because the 70--100% area overhead of double-ril circuits is a serious handicap. Wt. furthermore believe that, to some extent, less area implies shorter wires and thus less power and potentially more speed.
An additional chalilenge in the design of single-rail handshake circuits is that the compiler should produce standard-cell netlists that require minimal effort for postlayout verification. This push-button requirement implies that safety margins in the data bundles are an important aspect of the single-rail design flow. This paper necessarily focuses on some essential insights required to understand and appreciate the operation and implementation of single-rail handshake circuits. A more extensive treatment of the subject can be found in [91.
Handshake channels
In a handshake protocol, an active and a passive partner synchronize via a handshake channel by communicating request and acknowledge signals. The active partner initiates a handshake by issuing a request, the passive partner then responds by sending an acknowledge. On the handshake channel this results in an altemation of request and acknowledge events.
Handshake channels can be used to communicate data.
To this end, data can be encoded in the request (push channels), in the acknowledge (pull or combinational channels), or in both. Channels with no data encoded in the request and acknowledge are called synchronization or control channels. Communication on push channels is data driven: the (active) sender pushes the data through the channel. On pull channel the (active) receiver pulls the data through the channel and communication is demand driven. In the remainder we consider push and pull channels only. One way to encode data in the handshake protocol is to use the so-called double-rail code [ 111, in which two wires, zo and z1, are used to represent a bit z. When no value is present, both wires are low. A '0' is encoded by having zo high and z1 low, and a '1' by having z1 high and zo low. The state in which both wires are high is not allowed.
Double-rail encoding naturally combines with a fourphase handshake protocol for control. In a double-rail datapath the data can then be interpreted after the upgoing phase. In the return-to-zero phase the data is deasserted. Therefore, this phase is functionally redundant. This period can be shortened by a quick reset. In a singlerail data-valid scheme the data may still be valid during the return-to-zero phase of the handshake, which can be exploited to make this phase functional.
3 Single-rail data-valid schemes A single-rail handshake protocol is a contract between a sender and a receiver that defines the period during which the data on that channel is valid. The sender of the data issues a data-valid signal when the data on its output is valid, and the receiver responds by sending a data-release signal when the data has been processed (e.g. latched), cf. Fig. 1 . The sender has to keep the data on the channel stable and valid at least from the issue of the data-valid signal until the receipt of the data-release signal.
In a single-rail data-valid scheme the relation between sender and receiver, with respect to timing, is essentially symmetrical. A sender may take an indefinite amount of time to prepare valid data. After the issue of the data-valid signal, however, it has to keep the data constant. The receiver can then prolong the data-valid period as long as it wants, and end this period by issuing the data-release signal. This symmetry is broken if dynamic logic is applied to process daa (though chwge retention transistors can solve this) or if one of the handshake partners is ruled by a clock.
In the Macromodule project, communication of data was implemented using a single-rail handshake protocol, in which the request and acknowledge signals were called initiation and completion signals [12] . The terms single and double rail are due to Seitz [ 111. In the remainder of this section two-and four-phase data valid schemes are introduced. The two-phase push scheme is also known as the bundled-data convention [ Generally, only the early scheme is applied, except in [2], where broad is also mentioned.
Two-phase single rail
In the two-phase handshake protocol, there is no choice for the data-valid and data-release signals. On a push channel, every request signals the beginning of a datavalid period, and the subsequent acknowledge signals the end of the data-valid period. This implies that the data is valid during the handshake (between request and acknowledge) and that the sender can prepare new data between two handshakes.
In contrast, on a pull channel the data-valid signal is issued by the passive partner, and thus encoded in the acknowledge. The data-release signal is then encoded in the request. The sender of the data now prepares new data during the handshake, and keeps the data stable between two consecutive handshakes. The timing diagrams for push and pull two-phase single-rail channels are depicted in 
Four-phase single rail
In a four-phase handshake protocol half of the request and the acknowledge events are redundant. In the data-valid scheme we can choose which events are redundant, and which really designate the data-valid period.
On a pushi channel, for example, the data-valid signal is encoded in the request, and the data-release signal in the acknowledge:. The data-valid signal of course precedes the data-release signal. This leaves us with three possible choices, which are depicted as a timing diagram in Fig. 3 . On a pull handshake channel we can also choose three data-valid sclhemes. The timing diagrams for these are shown in Fig. 4 . One may observe that in the early scheme the data is valid during the four-phase handshake. In the late and broad scheme the data is valid between two handshakes, and hence must be kept stable by the sender.
Minimum-power single rail
In the deflnition of the data-valid schemes we have only defined what rules should be obeyed during the datavalid periods. The sender then has to keep the data stable so that the receiver can safely interpret it. Data-valid periods are separated by what we might call data-change periods, during which a sender can prepare new data. For this period we have not defined any restrictions yet.
An additional restriction to the data-valid scheme could be to limit the number of transitions in the data-path to at most one per bit. This obviously is the minimum number of transitions that should be allowed since the sender must be able to switch between any two messages. We call data-valid schemes with this additional constraint minimum-power schemes, since they minimize the number of transitions in the data channels, and thus minimize the energy consumption (assuming a CMOS implementation, in which energy consumption is dominated by switching energy).
In general we allow for an arbitrary number of transitions in the datapath during data-change periods. This means that spurious transitions due to deep combinational logic are allowed. Minimum-power schemes are more strict, which limits the implementation freedom and results in more circuit area.
Single-rail handshake circuits
Single-rail handshake circuits can be implemented by substituting single-rail implementations on a componentby-component basis. We concentrate on the data-transfer action described by Tangram fragment z := L @ y, where @ is some binary operation on z and y. We furthermore assume that there is also another assignment to z , resulting in a multiplexer on the write-port of z . The corresponding handshake circuit is depicted in A data transfer is initiated by a request signal on handshake channel a , which connects to a trunsferrer. This then sends a request for data along channel b. The operator forwards this request to the handshake lalches, which correspond to the variables z and y. The latches then assert data, which is processed by the operator and passes through the transferrer to channel c. The multiplexer forwards this data to the write port of handshake latch z , in which it is subsequently written. The acknowledge thereof passes through the multiplexer and the transferrer and finally arrives at channel a , which completes the data transfer.
For the realization of these components we have to choose a data-valid scheme for the data channels. Two conventions are discussed in this section, which are are both based on a four-phase handshake protocol and on a wire-only implementation of the transferrer. The latter is considered to be important since the transferrer is a frequently used component in Tangram handshake circuits. Other implementations of the transferrer and other interesting data-valid combinations may be found in [91.
Early scheme
A wire-only implementation of the transferrer that resembles the double-rail implementation [ 141 is depicted in Fig. 6 . When activated along channel a , this transferrer fist performs an up-handshake along b, then one along c, after which it sends an acknowledge on a. The return-tozero path follows the same cycle. If the data-valid scheme on b is known, the data-valid scheme for the transferrer can be determined, see Fig. 7. If an early data-valid scheme on b is used, the operational cycle of the transferrer proceeds as follows: after initiation of the data-transfer by a,T, between b,T and b a t , new data is collected from variables z and y, and this data is processed by the operator (@). The delay of this is then matched in the path from b,f until bar. At c,t, the data is valid at the transferrer. The multiplexer is then set in the correct state, and the data is latched in variable z. After this an acknowledge is sent by pulling c,, and hence a,, high. This completes the functional part of the data-transfer. The return-to-zero phase now follows the same path but does not affect the value of z , and thus is redundant.
This data-valid convention, where the data has to be latched when c,. A ~c , , is also described by Seitz 111, Sec. 7.8.21. It has, however, several apparent drawbacks. First of all, the return-to-zero phase of the data-transfer is redundant, which implies that it is either wasted time, or a quick-reset should be implemented. One could, for example, apply asymmetric delays in the pull part of the handshake circuit.
Other drawbacks, however, are harder to overcome, namely the complex control circuits for the latch and multiplexer. The latches in variable z have to be normally opaque, which implies that in the up-going phase of the handshake on e, the latch has to be opened and closed, and the return-to-zero phase should not affect the state of the latches. This requires a slow latch-control circuit, based on an S-element [14] . Furthermore, the multiplexer control should accommodate delay matching for the time required ito set the switches in the correct position and subsequently pass the data.
Reduced broad scheme
The broad and late scheme for pull handshake channels both assume data to be valid between handshakes, that is, until the initiation of the next handshake on that channel. For the variable this implies that data has to be latched in all readports. Although this leads to interesting circuits, they are not very area efficient.
A property of Tangram handshake circuits that has not yet been exploited is that read and write actions on variables are guaranteed to be mutually exclusive. For the variables in Fig. 6 this implies that a new write action on z or y will not be initiated until the write action on z , and thus all read actions on z and y, have completed. For the data-valid period on the readports of a variable this leads to an interesting new data-valid scheme, namely one in which the data is still valid after the read ackcdwledge, until the next write operation on that variable. lherefore, one may not assume the data to be valid until the next read request, as would be required for the broad scheme. This new data-valid scheme is called reduced broad.
In the Tangram compiler, this reduced broad scheme is assumed in data-transfer actions. A nice feature of this scheime is that it allows all phases of the handshake protocol to be functional and to contribute 10 the datavalid safety margin. The transferrer that is used in this convention is the one depicted in Fig. 6 . With icference to 11. Multiplexers are set and the delay through the multiplexer is matched. The latches in the variables are opened (switched from opaque to transparent).
111. The time required by the control circuit between a, t and a,l (at least one CMOS inversion) adds an additional safety margin to the data-bundle.
IV. The same path as in phase I is followed, which means that the delay in the datapath is again matched. After this the pull handshake components are back in their initial state, but the data is still valid.
V. Delay through the multiplexer is matched again in the control circuit of the multiplexer. Data is latched in This scheme combines efficient multiplexer and latch control circuits and allows symmetric delay matching. This means that the return-to-zero phase in the delay-matching path is functional, which can be exploited by either halving the delays to obtain the same margin, or by sticking to the same delay matching and thus doubling thie safety margin, or solmething in between.
Single-rail handshake components
Some of thie essential datapath handshake components are reviewed in this section. The implementation of the single-rail components implies some obligations for preand post-layout netlist verification. Important ciiaracteristics are the :Safety margins in the delay matching.
Handshake latches
The compilation scheme from Tangram to handshake circuits guarantees that read and write actions on variables are mutually exclusive. This implies that whenever a write rque:jt arrives, all read actions have completed. In that sense, ,a write request may be interpreted as a &A-release signal for all read ports. This observation forms the basis for the reduced-broad data-valid convention.
The variable is implemented such that during a write handshake the: values at all read ports are modified. The associated data-valid scheme is depicted in Fig. 9 . The up-going write request is interpreted as the combined datarelease signal for all read-ports, and the down-going write acknowledge as its combined data-valid. Therefore all read-ports can be implemented by wires only: the read acknowledge is connected to the read request.
For the implementation of the variable, a latclh has to be chosen, and based on that choice, a latch write control circuit. We have chosen the circuits depicted in Fig. 10 . This latch requires complementary control signals, denoted by en and ne. These signals are generated by a simple latch control circuit. One may observe that no Muller C-element is used to generate a completion signal The correct operation of the latch control circuit depends on assumptions about (i) the minimum time between w,T and w r J , and (ii) the capacitive load at the latch outputs. To illustrate what may go wrong when the latch opens for only a short period, consider the events depicted in Fig. 11 . The circuit simulated here consists of a 4-bit variable and a realistic, fast environment. As a result, the pulses generated on the enable and its complement are short. Together they allow for a 1 nanosecond period of transparency of the latch. It now depends on the output load of the particular latches whether this time suffices to make a transition.
Initially, the outputs of all latches are high and the inputs of all latches are low. The four latches are loaded with 0.1, 8.5, 0.8, and 1.1 pF. The latch with the 0.lpF load switches fast: its transition is completed before the latch closes again. The second latch has an 0.5pF output load. The open period of the latch still suffices to make a nearly complete swing. While the latch closes it completes its transition.
The third latch is critical. Due to the 0.8pF load it requires more time to make a complete transition. When the latch closes again, its output is still changing. The output now takes considerably more time to complete this transition. The fourth latch, with the l.lpF load, does not make it. When the latch closes again its output has not yet passed the threshold. Therefore, when the feedbackpath is created, the output overwrites the state of the latch. One may observe that this problem is closely related to synchronization and metastability problems.
From an engineering point of view, if one wants to build a working system, safety is an important issue. It is therefore useful to establish safety guidelines for the implementation of variables. At the outputs of latches that have a high fan-out, and thus face a high capacitive load, additional buffers (drivers) have to be inserted to assure timely transitions. e n 4
C n e '7
Figure 10: Multiplexer-like latch circuit (top) and associated latch control (bottom). The feed-back stack of the latch only compensates leakage and can therefore be minimum sized. As a result, the input load on the en-line will be less than the input load on the ne-line. This is exploited in the control circuit.
Multiplexers
In a multiplexer the data of two handshake channels (say a and b) has to be merged onto a third channel (say e). This requires a multiplexer cell per bit and an associated multiplexer control circuit. For the multiplexer cell a transmission gate configuration could be used, but we have chosen to implement this with an and-or function, since this generalizes readily to more inputs.
For the control of the cells we use two (complementary) select signals, say a , and b,. For bit i we can then use a cell with functionality ci = a; a, + bi b,. The control circuit of the multiplexer now has to generate the select signals and the request/acknowledge signaling required for the handshake protocol. A possible realization of the control circuit is depicted in Fig. 12 . This circuit leaves the select signals in the last selected state, which saves power in the datapath be- This would provide a perfect match with the datapath. However, since the acknowledge signal also introduces some delay, we do not necessarily have to implement a full match in the request path. In the circuit in Fig. 12 , the request is forwarded early, and the sum of the delay of the or-gate, the inverter, and the nor-gate should match that of the data. It is furthermore assumed that the select signals switch sufficiently fast, that is, before c, t arrives.
Operators
Tangram offers parameterized adders, subtractors, and comparators for different combinations of unsigned and two's complement numbers. Furthermore, simple boolean functions are supported (and, or, x(n)or, and inverse).
The general structure of the implementation of binary operators is depicted in Fig. 13 . The evaluation scheme is demand driven. The incoming request is forked to the two operands and the subsequent acknowledges (datavalid) are combined with a C-element. The C-element is generally functionally redundant and is removed in a post-optimization procedure (see below). Its delay should therefore not be anticipated upon in the delay-matching.
In all operators, delay matching is used to generate completion signals. The matching is done on a worst case basis, which means that for the adder, for example, the full length cany-ripple is anticipated upon, rather than the lower average case. A double-rail carry scheme could be employed to obtain optimal delay matching, but that requires more circuitry [61.
The matching is implemented by using strings of inverters or nand and nor gates such that delays match at about equal output load, using the linear delay versus load functions from the characterized cell library. For this to work it is again important to have a good driver strategy, such that transition times (and delays) in the datapath are short and predictable before layout. In the reduced broad scheme, the delays occur in the critical path twice, which gives a factor two safety margin, or can be used to halve or at least reduce the delays.
Single-rail design flow
One of the innovative aspects of our approach to singlerail asynchronous circuits is that they are compiled from a high-level 'VLSI programming language. This requires us to pay atlention to two essential goals, namely, the resulting circuits should be area/power/timing efficient and they shoiuld be push-button correct. Compilation from Tangram to handshake circuits is independent of the style of implementation of these handshake circuits. It was not obvious beforehand that this independence could be maintained, but it turned out that handshake circuits were indeed a viable starting point for single41 compilation. The compilation from Tangram to handshake circuits is a transparent syntax-directed translation [ 141.
Peephole optimization
Post-optimization can be considered at all levels of representation. At the handshake circuit level an interesting saving is obtained by replacing trees of sequencers, mixers, and multiplexers by equivalent multi-channel components. This is added as a peephole optimization step to the compiler from Tangram to handshake circuits.
Compilation from handshake circuits to single-rail standard-cell netlists is implemented as a component-bycomponent substitution process. Each component obeys a four-phase handshake protocol on each channel, which makes the netlist generators well-manageable. It does result, however, in a netlist that contains redundancy and possibly inefficiencies. In a post-optimization step these are eliminated.
The role of some of these optimizatiops can be illustrated on the compilation of Tangram expression (2.0 ,t 2.1) + ( a -1), in which a is an integer, and 2 is a tuple of two integers. The compilation of this expression results in the handshake circuit shown in Fig. 15 . The control circuit that is generated by the netlist generators is also depicted.
This circuit contains redundant C-elements (C-elements with identical inputs) and parallel delays. In the postoptimization process these are eliminated and only delays remain. The figure shows a simplified view of the delays. Each component generates a string of unit-delays, long enough to provide a worst-case match with its datapath.
In general, nearly all C-elements that are generated from binary operators and from components used to tuple and de-tuple datapaths are eliminated from the netlist. Note that this is possible because in the control circuit for the readport of a variable (or a constant) the data-valid wire is directly connected to the data-release wire. Furthermore, depending on the structure of the expressions in the Tangram program, a lot of delays are eliminated because they are in parallel with other (equivalent) delays.
A second source for post-optimization is the datapath itself. Tangram offers a limited set of data operators, from which others can be constructed. A decrementer, for instance, is not a basic operator but can be constructed from a subtractor and a constant, see Fig. 15 . As a result of this the generated netlist is not optimal, and in the postoptimization step the subtractor and the constant (which keeps one of the inputs of the subtractors constant (high)) are degenerated to a decrementer. Other types of optimizations that are applied are input extension (e.g., combining two or-gates into a three-input or-gate) and bubble shuffling (e.g., to replace and-by nand-gates in the circuit generated for boolean expression a -b = c . 4 . An essential property of the peephole optimizations is that they do not undermine the safety margins in the bundles. Improvements in the datapath generally reduce the latency of the rippling data, and thereby increase the margins. The optimization step has been implemented using a rule-based req ack Figure 15 : Post-optimization example showing handshake circuit (top), netlist for control circuit generated by direct substitution (middle), and post-optimized control circuit(bottom). The dashed boxes indicate the relation with the handshake circuit.
pattem-matching algorithm (VERA, [7] ). Logic optimizers could also play a role in the datapath optimization.
Safety and margins
Building a single-rail compiler in a standard-cell layout style poses some challenges due to the uncertainty before layout of the actual wiring capacitances. Fortunately the effect of placement and routing is subject to limited predictability, which forms the basis of the safety margins and its post-layout checking.
One aspect of the safety procedure is that completion detection is applied to high-fanout req/ack wires, such as enable lines in variables and select lines in multiplexers. Another important point is the implementation of delays. In delay matching, the bundling constraint dictates that the delay should match against the worst-case delay of the datapath. This worst-case delay cannot always be determined on a component basis, because of the possible excessive fanout in the datapath due to sharing. To make the matching work, timely transitions in the datapath should be assured by a dedicated driver strategy (see below).
The choice of the safety margins in the delay matching should be based on a trade-off between performance and verification effort. The unpredictable factor again is the wiring capacitance, which may be minimal for the delay matching path. We have chosen to implement delays such that they match at about equal loads, which gives a factor two safety margin because the delays are in the matching path twice (in the (reduced) broad scheme). For high performance applications this margin might be reduced to 30% or less. Our primary concern, however, has been to keep the verification obligations manageable.
A driver strategy is a procedure to resize transistors or to add drivers, such that timely transitions can be assured. This is required to guarantee proper delay matching and to minimize short-circuit dissipation. In the design flow the driver strategy is applied before cell placement, but after peephole optimization. Given the linear delay versus load characterization of the cell library, the procedure is based on adding for each node in the circuit all fanin loads and, based on the fanout degree, an estimated wireload. It ithen depends on the cell that is driving that load whether additional drive should be added. The exact implementation of the driver strategy depends on the available repertoire of standard cells, which generally consists of standard gates and a range of inverters and buffers of different driving strengths. For large netlists a postplacement adjustment is also required, which can then be based ori more dedicated wire-load estimations.
After layout the wiring capacitances are known and it can be verified how realistic all timing assumptions have been, especially how safe the data-bundling is. The actual delay of the delay-matching chains can be determined straightforwardly. Determining the critical path of the combinational part is more complex, but can done by a timing verifier [31.
We have not implemented the critical-path checks in the design flow but rather applied timing simulations based om accurate timing models of the standard cells. In these simulations the set-up and hold time constraints of the latch can be verified in various scenarios. When there appears to be a safe margin, the silicon is likely to operate correctly under a reasonable range of operating conditions.
The verification process has been applied to multiple designs and uncovered no problems so far. The design flow has also resulted in first-time right silicon, which is functional over a wide supply-voltage range [161. This suggests that there is room for tighter delay-matching.
Results
All handshake components have been mapped onto a standard1 generic cell library without any problems. To evaluate the efficiency of the single-rail circuits, some experiments were done and compared with double-rail and results reported by others.
An interesting example is a FIFO, because in that circuit no multiplexers or operators are used, so that the variable is operated at full speed. Simulations of a compiled 3%-bit FIFO were done, and it turned out that the circuit operates correctly at a speed of 115 MHz, using the latches and control of Fig. 10 (0.8 p CMOS technology, at 5 Volts and typical processing). Although this circuit uses normally-opaque control, the cycle time compares well with that reported for dedicated micropipeline latch-control structures [41.
A single-rail implementation of the DCC error decoder [15] ha5 led to functionally and audibly correct silicon.
The single-mil chip is an improvement over the doublerail version in all performance aspects. It is 40% smaller and uses only 25% of the power. Furthermore, it is potentially 3 times as fast, which is not required in the DCC player, but zlllows it to be operated at a lower supply voltage. A detailed evaluation of the single-rail silicon, and a comparisaln with various other realizations of the same function can1 be found in [16] .
Comlparison with double-rail
Area Sevleral other designs have also been compiled to netlists and were simulated. A list of these, and a comparison with double-rail implementations of the same circuits can be found in Table 1 As is to be expected, the gain in circuit area and transistor count varies with the percentage of the circuit required for control. Data-dominated applications, such as the FIFO and the Speech codec, profit highly from the single-rail implementation of the data. A significant part of the gain in the (de)coders is due to the inefficient double-rail implementations of adders and excliusive ors. These implementations can, however, be improved [ 
171.
The DCC controller is control dominated, so single-rail and double-rail make less of a difference.
Power Energy efficiency was a second order consideration in going from double-rail to single-rail. At first sight one could argue that on a double-rail N-bits handshake channel each communication results in exactly 2N+2 transitions, whereas the equivalent single-rail channel would require only ;IV +-4 transitions on average (data dependent), which suggests an improvement of a factor 4 (actually 3, for practical N ) .
However, some counter effects exist. Mosl notably, we did not restrict the transitions on the data wires in the period in which the data was nor valid (damchange period). In the reduced-broad scheme for data-transfers, for example, the data on read-ports of variable is allowed to change between handshakes. This resulted in an implementation of the variable where a change aif state is immediately visible at its output, and thus at all its readports. This may have a dramatic impact on the energy consumption, because the readports may fanout to many logic gates. Apart from the sum of input capacitances of these gates lhat has to be switched, the output of these gates may also be affected. Especially in deep combinational logic (for example in multipliers), this can lead to a lot of redundant transitions.
A consequence of the single-rail implementation of handshake circuits is thus a certain loss in the transparency of the compilation from Tangram to VLSI with respect to power. In double-rail, writing a variable does not lead to transitions on the readports, which means that values have to be explicitly collected during a read action. In single-rail all readports are evaluated in each write action. It now depends on the structure of the handshake circuit (and thus, on the Tangram program) whether these transitions are redundant or not, and thus, whether power is effectively lost. In a FIFO, for example, all transitions are functional, but a parallel multiplier might give rise to quite some spurious transitions.
Speed Single-rail datapaths are expected to be faster than their double-rail equivalents. The main reason for this is the completion detection required when writing double-rail variables. This is typically implemented using a tree of C-elements which makes writing of many-bit variables a slow operation.
Another reason for the expected gain in speed is that a double-rail datapath requires complex gates with stacks of nMOS and PMOS transistors that are typically higher than required for the equivalent single-rail function. The average fanout in a double-rail datapath is, on average, also higher. The safety margin that has to be incorporated in matching delays partly cancels against the return-tozero delay in the double-rail datapath.
Testing In a double-rail datapath, a stuck-at output fault on a wire that represents a bit leads to deadlock when that bit has to be communicated, either because it does not make an up-going transition, or because it does not return to zero. This property makes stuck-at output faults in the datapath observable.
In single-rail datapaths, stuck-at faults on data wires will not always give rise to deadlock, making testing more complex. However, observability can be enhanced with scan-testing techniques [ 101.
Comparison with two-phase bundled data
Our single-rail handshake circuits are based on a fourphase bundled-data protocol. Many implementations of bundled data circuits, however, use a two-phase signaling protocol [S, 8, 131 . The use of such a two-phase protocol at first sight leads to faster and more power-efficient circuits, since only half of the control transitions are required. However, it also leads to some complications, most notably in latch and multiplexer control circuits. Two-phase latch control requires Toggles and Merges (xors) to convert from the two-phase protocol to the levelsensitive enable signals. The Toggle is not a common CMOS gate, and its implementation is complex [8] . An important consequence of the use of Toggles and Merges is that they are in the critical path of the control, which leads to poor cycle times, and hence restricts the maximum attainable speed in a micropipeline. The use of capture-pass latches [13] would allow for a simpler latchcontrol circuit, but then the latches are slow and require more transistors. Recent insights seem to lead to the conclusion that a four-phase latch-control circuit is more areaefficient and leads to significant faster circuits [4] . Two-phase multiplexer control could be based on a two-phase call component, the implementation of which is quite complex and requires conversion from two-phase to four-phase signals for the control of the level-sensitive multiplexer switches.
In conclusion, two-phase handshaking does not seem to be a viable alternative for handshake circuits.
Conclusion
The goal set for the work described in this paper was to improve the area-efficiency of handshake circuit implementations, without affecting the translation of Tangram to handshake circuits, and with a generic standard-cell library as target technology.
With the choice of single-rail handshake circuits, a generic cell library seems to offer all cells that are required for efficient implementations. In the control and in part of the request/acknowledge signaling, C-elements are required. Many sorts of C-elements can, however, readily be implemented using and-or-invert types of complex gates. The implementation of the single-rail datapath is straightforward, since the cells required are all available in the library.
The combination of a four-phase handshake protocol and single-rail (data bundling) leads to surprisingly many choices for data-valid schemes, with trade-offs that have a big impact on the area, speed, and power-efficiency of the resulting circuits. An important consequence of the data-valid scheme chosen in this paper (reduced broad) is that all four phases of a four-phase handshake protocol can be productive without compromising performance.
Future area reductions of handshake circuits may be expected from paying more attention to control and relaxing the bundle constraints in the datapath, possibly at the price of more verification effort.
