We propose and implement a four-phase handshake protocol for bundled-data asynchronous circuits with consideration given to power consumption and area. A key aspect is that our protocol uses three phases for generating the matched delay to signal the completion of the data-path stage operation whereas conventional methods use only one phase. A comparison with other protocols at 0.18 µm process showed that our protocol realized lower power consumption than any other protocol at cycle times of 1.2 ns or more. The area of the delay generator required for a given data-path delay was less than half that of other protocols. The overhead of the timing generator was the same as or less than that of other protocols.
Introduction
Deep sub-micron technology enables a large system to be implemented on a single chip. In these systems, serious skew problems with synchronous designs arise in that a global clock must be distributed over the entire chip because of decreased gate delay and wire-dominated delay [1] . Power consumption is also a big concern. The use of asynchronous design could potentially solve these problems.
The bundled-data style asynchronous design utilizes single-rail encoded data paths (the same as in synchronous designs) as shown in Fig. 1 , and each stage of the data path is controlled by a timing signal based on a requestacknowledge protocol instead of clock. This alleviates the skew problem by localizing the clock-tree. In addition, operating speeds are averaged over computations because the period of timing signals can be varied according to computation classes (e.g. arithmetics, logic operations, etc.).
In the bundled-data style, handshake protocols (and timing generators implementing the protocol) must operate at high speeds with low overheads and should consume less power because the timing signal is a substitute for the global clock. Although many handshake protocols have been proposed [3] - [7] , some protocols impose large overheads in initializing the controller state [3] , [4] . And other protocols have following problems. MOUSETRAP [7] uses a small controller, which consists of only one latch and one XNOR gate per data-path stage. This feature enables us to efficiently create an acyclic data path, including forks and joins, but is unsuitable for building cyclic structures, and it is difficult to dynamically vary the delay amount of the delay generator. GasP [6] has low reverse latency and therefore operates at high speeds. But GasP is not suitable for processing pipelines because it was designed for minimal FIFO control. In [2] a timing generator was designed based on GasP for controlling data paths with processing logic. But a delay generator used in the design has constraints to guarantee correct handshaking. The constraints tend to increase the area and power consumption of the delay generator. In this paper we propose a novel four-phase handshake protocol and its controller design that is suitable for pipelines with processing logic. It not only operates at high speed with less constraints imposed on the delay generator, but also reduces the area and power consumption of the delay generator.
The remainder of this paper is organized as follows. Section 2 describes protocol requirements for handshaking and proposes a novel protocol. Section 3 presents an implementation of the protocol which consists of controllers called HSCs and delay generators. The proposed method is evaluated and compared with previous works in Sect. 4. Finally, Sect. 5 gives conclusions.
Proposed Handshake Protocol
The timing generator consists of controllers, delay generators, and wires connecting them, as shown in Fig. 2 path stages and send timing signals to the corresponding data-path registers (latches or flip-flops). Delay generators are inserted to guarantee the completion of the corresponding data-path stage, and the connecting wires send a requestacknowledgment signal.
Generally handshake protocol describing controller acts as follows for handshaking: 1) Waits for a write request (arrival of new data) from the preceding controller and a read request (completion of the succeeding stage) from the succeeding controller. 2) Sends a timing signal to registers when it receives both of the request signals. 3) Sends acknowledgment signals to the preceding and succeeding controllers (request signals for the preceding and succeeding controllers).
Ideally, the protocol for the bundled-data style needs to satisfy these requirements with lower overheads and also to enable us to design the controller easily and compact.
Two kinds of protocols are known so far, four-phase and two-phase handshake protocols. Most protocols are categorized into the four-phase handshake protocols. The design of the four-phase controller is comparatively easy, because the operation is based on the level of the signal. However, it requires the return-to-zero phase which incurs severe overhead. A remedy is to design the delay generator with AND gates to decrease this overhead.
On the other hand, the two-phase handshake protocol doesn't have the return-to-zero overhead. However, the design of the controller tends to become difficult, because it operates by the transition of the signal. It is also difficult to apply it to cyclic data paths.
We propose here a novel four-phase handshake protocol, in which inverters or buffers can be used as delay generators, leading to low overheads.
The behavior of protocols is represented in Signal Transition Graphs (STG) [4] , [8] as shown in Fig. 3 , where symbols A, B, C denote timing signals A, B, C in Fig. 2 and "+" and "−" stand for their rising and falling transitions, respectively. Rounded rectangles are delay generators inserted in forward lines, and dashed arrows indicate return-to-zero overheads.
A key aspect of our protocol is that conventional methods shown in Fig. 3(a) ,(b),(c),(d) use only one phase for generating the matched delay to signal the completion of the data-path stage (period from A+ to B+ or from B+ to C+), whereas our protocol Fig. 3 (e) uses three phases for that function. Moreover, to reduce overheads, all transitions can occur independently of the state of the adjacent stage, except rising-edge of the timing signal of the current stage.
Details of our protocol in terms of request-acknowledge • It utilizes the delay of the delay generator twice (both rising and falling edges) to guarantee the completion of the stage operation.
• Therefore it needs only half the number of delay elements compared with conventional protocols.
• The overhead consists of only one phase (B+ →A+), which does not pass the delay generator, and thus spends short time.
Implementation of Proposed Protocol
In this section, we implement the protocol by designing the timing generator which consists of controllers and delay generators. 
Controller

Circuit Design
A design of the controller satisfying the protocol described in the previous section is shown in Fig. 5 , where modules enclosed with dashed boxes are called the HSC (handshake controller). The timing signal generated by the HSC is applied to flip-flops. The HSC circuit is designed at transistor level and realized as a standard cell which is then used for synthesizing bundled-data asynchronous circuits. The HSC fires and generates the timing signal when both w req and r req go high. Note that our protocol features that all transitions can occur independently of the state of adjacent stages except that transitions w ack+ (timing signal) and r ack− only occur when both write-request and read-request signals go high. Therefore, one of the request signals possibly returns to low then go high while the other remains high. To prevent the timing signal from being generated at this time, we prepare the HSC with transistors w and z controlled by two keepers Sr and Sw which maintain states of reading and writing sides, respectively.
A high state of Sr turns on the transistor w, which enables high-level request signals to affect w ack. A low state of Sr turns off the transistor w, which prevents w ack (timing signal) from going high. Functions of the keeper Sw are described similarly. When a timing signal is generated, both Sw and Sr go low. Sr (Sw) returns to high when r req (w req) goes low.
For example, consider the case when a timing signal is generated, and subsequently w req returns to low and then go high while r req remains high throughout this period. Signal transitions relevant to this case is shown in Fig. 6 . As shown in the timing chart, once w ack goes low (timing signal goes low), it does not go high again until the other request signal r req returns to low and then goes high. This is guaranteed by Sr which keeps a low-level state after the timing signal is generated. Thus, no timing signal is generated even when one of the request signals returns to low then go high while the other remains high. The timing signal can be generated again only after both request signals return to low. Restrictions imposed on delay generators using the HSC to ensure correct behavior are as follows: 1) When HSC-A in Fig. 5 fires, signals propagate through (u v w) and (x y z) inside the cell faster than through (e c2 t2 d1) or (t2 b1 e c2) outside the cell. 2) When r req of HSC-A falls, the signal propagates through (u v w) faster than through (e a2 t2 b1), and when w req of HSC-B falls, the signal propagates through (x y z) faster than through (t2 d1 e a2).
3) The pulse width of timing signal B, which is determined by the propagation time through (b1 e c2 t2), must be longer than the width of a minimum clock pulse.
Restrictions 1) and 2) are for holding the correct state in the controller. If it is not satisfied, the states of the keeper Sw or Sr change incorrectly and the handshake is not performed correctly. Restriction 3) is required for latching data to registers.
To ensure that the protocol is sufficiently robust against delay variation, it is necessary to assign a certain amount of delay to the delay generator according to the delay model used [10] . In fact, the delay inside the cell is less than the delay outside the cell, and no stage is shorter than the minimum clock pulse. Therefore the restrictions are not at all tight in the design of the delay generator.
Because HSCs in Fig. 5 are supposed to be used for data paths with registers formed by flip flop, the pulse width of timing signals does not have to be adjusted. When using latches for registers, we can add a circuit that creates timing signals with the required pulse width, as shown in Fig. 7 , where the delay T inv of the inverter is adjusted.
Keepers Sw and Sr are initialized to be high by reseting w rEq. The initialization is easily realized by incorporating a reset gate into the delay generator output.
HSC Family
To make the proposed protocol more applicable in many situations, we need a sort of HSC controllers. An example of conceptual design of HSC family for loops, forks, joins,branches, and merges is shown in Fig. 8 . The structures are often required for controlling data paths. Figure 8 reveals that the required control structures are realizable in our HSC scheme. Note that some conventional methods do not allow some of the control structures: for example, MOUSETRAP does not allow cyclic structures. Detailed comparisons among protocols in this respect will be discussed in Sect. 4 .
The design shown in Fig. 8 is conceptual in that it operates only for a small number of inputs and outputs. For a large number of inputs and outputs the circuit would need to be partitioned into smaller submodules. The detailed circuit design and analysis of overheads which will be caused by the circuit partitioning belong to future work.
Note that the sel signal, which determines the branch direction in Fig. 8(b) , must arrive earlier than a write request (w req).
Delay Generator
The delay generator is a circuit inserted to guarantee the ar-rival of data. For the delay generator, a wire delay (gates with long wires), a chain of gates, and a chain of gates with long channels were considered [10] . A chain of gates is robust against delay variation, and gates with long channels have low power consumption. For comparative evaluation of our protocol with others, however, we use a chain of standard cell gates as the delay generator in this study.
Delay generator needs to generate the required delay with low power consumption and a small area per unit delay. When using a standard cell, the delay generator may be formed either by inverters, buffers, or AND gates. Among these options, it is better to use gates best suitable for the above conditions. Figure 9 shows standard delay generators, where (a) and (b) consist of an even number of inverters and AND gates, respectively. The rising-edge delay of circuit (a) is the sum of gate delays, and the same is the case with the falling-edge, whereas circuit (b) generates one gate delay at the falling-edge. Utilizing circuit (b) greatly reduces the return-to-zero phase overhead. However, employing AND gates requires more area, wire routing, and power consumption than using inverters or buffers.
Because of this difference in delay characteristics, the type of gates that can be used are often restricted by the protocols and controllers. Fully decoupled protocol should use (b) to reduce return-to-zero overheads [4] . GasP based protocol must use (b) to prevent from rereading old data as shown in Fig. 6 (c) in Ref. [2] . MOUSETRAP uses the rising and falling delays of the delay generator alternately for every cycle, so these delays must be the same, leading to the use of inverter chains.
On the other hand, the HSC does not depend on characteristics of the rising and falling edges of the delay generator. Thus The HSC can use any type of gates, which implies that the best type of gates available in a cell library can be employed. Next we examine the cycle time for our design. The cycle time T cycle is given by the sum of the delay time T S generated for stages required to compute the data, and the return-to-zero overhead time T o :
where T HS , T D , and T ACK are delay times of the HSC, delay generator, and acknowledgment signal wire (ack wire), respectively. Suffixes "pre" and "suc" stand for sending and receiving sides, respectively, while "+" and "−" stand for rising and falling edges, respectively. The delay time of a delay generator T D is determined from the following inequality by substituting T S with Eq. (1):
where T cl , T setup , T w , and T lct are the delay of the combinational logic of the data-path stage, the setup time of the register, the write delay of the register, and the delay of the local clock-tree from the HSC to flip-flops comprising the register, respectively. The driving capacity of the controller should be high, because the timing signal should be distributed by local clock trees with short and wide paths. Delay generators for data path with delay D are shown in Fig. 10 , where (a) shows the structure used in conventional methods. In our method, the delay of the delay generator is half of the corresponding data-path stage delay as shown in Fig. 10(b) , supposing that the rising and falling delays of the delay generator are the same. Because a signal transition passes through the delay generator twice (rising and falling) as indicated by T + D and T − D in Eq. (1), it is possible to construct the delay generator with inverters or buffers. In the figure α is the delay added to ensure robustness against delay variations. Because gates that generate rising and falling delays of D/2 can be used to provide the delay for the data-path stage D, the area of the delay generator can be small, reducing its power consumption. This is particularly effective when the delay generator is dominant in area of the timing generator.
The delay of the local clock-tree of the send register T lct pre can differ from that of the receive register T lct suc . In that case the following condition must be satisfied.
where min[T cl ] stands for the shortest path of the combinational logic at the current stage of the data path. This is the condition required for avoiding overwriting the new data before the previous data has been latched to the receive register.
When the local clock-tree delay of the receiver is longer than that of the sender, Eq. (5) may not be satisfied. There are two remedies for this problem: 1) to make levels of local clock trees equal over the data path, and 2) to insert a small delay β to adjust the delay of T + ACK in an acknowledgment wire as shown in Fig. 10 . We adopt the second method because short clock trees are preferable. As the result of the insertion of β, T S changes. The delay of the delay generator must be adjusted accordingly.
Evaluation and Discussions
We designed timing generators based on the HSC and other protocols, as shown in Fig. 11 , and measured their power consumption, cycle time T cycle , and data-path delay T S by varying the delay time T D of the delay generator. Each controller was designed to have the same driving capacity. We chose a cell with the smallest driving capacity from standard cells used to construct a delay generator.
The timing generator of the MOUSETRAP was constructed as shown in Fig. 11(a) to make equal the rising and falling delays of the delay generator. Timing generators of Fully decoupled and GasP based were constructed as shown in Fig. 11(b) to reduce the return-to-zero overhead and to ensure the correct handshaking, respectively. The construct shown in Fig. 11(c) was used for the HSC. The buffer consuming the least power among standard cell gates was used in this study.
Measurements were performed using HSPICE Ver. 2003.3 (Synopsys), under the condition of 0.18 µm process technology and 1.8 V power supply voltage. The model parameter BSIM3v3 (Level=49) [11] was used. We utilized standard cells created by Kyoto University [12] using the same process. The wire delay and capacitance were not included. The cycle time was measured as the time from a rising edge to the next rising edge of the timing sinal B. The data-path delay was measured as the time from a rising edge of the timing sinal A(B) to the rising edge of the timing sinal B(C) in Fig. 11 .
Results are shown in Fig. 12 and Fig. 13 . Data-path delay, operation frequency, and power consumption at the maximum speed of each protocol are summarized in Table 1 . Figure 12 shows the cycle time of the timing signal versus power consumption of whole evaluation circuits. When the cycle time is short (i.e., the corresponding data-path delay is small), the delay generator is short. The leftmost point in the figure shows the fastest case for each timing generator. For longer cycles, the delay generator tends to dominate the entire timing generator, and power consumption of the timing generator approximates that of the delay generator. As Fig. 12 shows, power consumption of the timing generator based on the HSC was less than that of the Fully decoupled and GasP based cases at any frequency. When the cycle time was short, power consumption of MOUSETRAP was a little lower than that of the HSC because MOUSE-TRAP has a smaller controller. When the cycle time was longer than 1.2 ns, the HSC consumed less power than any other protocol. It was found that power consumption of the HSC was half that of the Fully decoupled and GasP based in case of a longer cycle time. Thus power consumption of our method was low at any frequency. In particular at relatively low frequencies, our method resulted in significant power savings compared to other protocols. Figure 13 shows the cycle time versus data-path delay for each protocol. The ideal line is at the case T cycle = T S . As the figure shows, the HSC's overhead was the same as or less than that of other protocols.
The area of the delay generator also increased with an increase in the delay of the data-path stage, but the rate of the increase differed between the HSC and other protocols. The increased area of the delay generator in relation to the data-path delay is shown in Fig. 14 . Values of the cell area shown in results are the number of tile units on which standard cells are placed. The ratio of gate areas was inverter : buffer : AND = 6 : 8 : 10. As the figure shows, the rate of increase for the HSC's delay generator was less than half that of other protocols. Because the delay of buffers and AND gates is larger than that of inverters in standard cells used in this paper, in the case of MOUSE-TRAP there was a large increase in the area.
Experiments were carried out using timing generators for pipelines with a small number of stages. Results are essentially the same using pipelines with a larger number of stages. Figure 15 shows the cell layout for the HSC. The VLSI layout yields a HSC cell with the size of about 5.25 NAND cells.
From experimental results described above, our proposed protocol are compared with other conventional methods [2] , [4] , [6] , [7] and summarized in Table 2 , where characteristics with respect to speed, area, and power are compared at the left, and availabilities of constructs for controlling various data-path structures are shown at the right. Our method is found to be good with respect to both performance and costs compared to other methods. It is also applicable to various kinds of data-path structures.
Our method utilizes rising and falling edges of the delay generator in the data-path delay T S , while in other methods, only one of the phases can be utilized to generate the delay T S . This results in a reduction of the overhead, area, and power consumption of our method compared to others. The least stage delay that can be generated by our method is a little longer than that of GasP based. This is because three phases are assigned in HSC for creating T S , and a bit more transistors are used in the HSC than GasP based.
Reference [13] is related to our work in that both rising and falling edges are utilized to compose the delay generator, which allows the use of inverters or buffers. However, handshaking between adjacent stages is not considered in that study. Examples of data paths described in [13] requires control structures represented by STGs containing only one token which leads to very simple control structures. On the other hand, our protocol is proposed for controlling various types of data paths requiring handshaking between adjacent stages. Furthermore, evaluation of the method is performed only at gate level in [13] , whereas our evaluation is based on transistor level design.
Conclusions
In this paper we proposed and implemented a handshake protocol for bundled-data asynchronous circuits with consideration given to power consumption and area. A key aspect was that our protocol used three phases for generating the delay to signal the completion of the data-path stage whereas conventional methods used only one phase. Moreover, by accepting the falling edge of write and read requests independently, as well as preventing overwriting new data or rereading old data in the controller, the method enables power consumption and area of the delay generator to be reduced while maintaining high performance.
Comparisons with other protocols showed that our protocol realized lower power consumption than any other protocol at cycle times of 1.2 ns or more. The area of the delay generator required for a given data-path delay was less than half that of other protocols. Moreover, the overhead of the timing generator for this protocol was the same as or less than that of other protocols.
