This paper presents a novel architecture of an asynchronous FPGA for handshake-component-based design. The handshakecomponent-based design is suitable for large-scale, complex asynchronous circuit because of its understandability. This paper proposes an areaefficient architecture of an FPGA that is suitable for handshake-componentbased asynchronous circuit. Moreover, the Four-Phase Dual-Rail encoding is employed to construct circuits robust to delay variation because the data paths are programmable in FPGA. The FPGA based on the proposed architecture is implemented in a 65 nm process. Its evaluation results show that the proposed FPGA can implement handshake components efficiently.
Introduction
Recent technology scaling enables designs with billions of transistors. On the other hand, the increased complexity of circuits leads to two problems. The first is the cost problem. The process development cost has increased the expense of the fabrication cost of chips. Also, design cost and verification cost become serious problem. The second is performance problem. Currently, most digital circuits are synchronous circuits which operate based on clock signals. As the number of transistors integrated on a chip has increased, clock distribution network has become complex and its power consumption has become large. In addition, it becomes severe challenge to increase clock frequency because clock signal should be distributed all over a chip.
To solve the first problem, Field-programmable gate arrays (FPGAs) are widely used to implement special-purpose processors. Since users can program logic functions and interconnections of FPGAs directly, it is easy to develop special-purpose processors. In addition, FPGAs are costeffective because they are produced in large quantities.
To solve the second problem, asynchronous circuit is attracting attention. In asynchronous circuit, data transfer is done by handshaking using a request signal and an acknowledge signal. Since no clock signal is necessary, problems caused by clock distribution network do not arise. However, the problem is that it is difficult to design asynchronous circuits.
As the design methods for asynchronous circuits, handshake-component-based design [1] was proposed. In handshake-component-based design, asynchronous circuits are designed by connecting handshake components. Since various handshake components such as for data processing and data path control are defined, it is easy to design asynchronous data path and its controller. Therefore, handshakecomponent-based design is suitable for applications that contain complex data processing. Besides, Balsa [2] is proposed as a design methodology that uses handshake components. Balsa is a hardware description language and it allows circuit designers not to pay attention to low-level details such as control of handshake. Moreover, there are synthesis tools that generate handshake circuits which consist of handshake components and standard cell netlists from Balsa descriptions. Using Balsa, circuit designer can easily implement complex large-scale circuits such as a DMA controller [2] and a microprocessor [3] . Thus, handshakecomponent-based design is suitable for complex large-scale asynchronous circuits.
To solve the cost and performance problems, some asynchronous FPGAs has been proposed [4] - [10] . Asynchronous FPGAs developed by Cornell University [4] , [5] , Achronix [6] and the University of Tokyo [7] employ finegrained pipelined architecture to achieve high throughput. References [8] - [10] propose asynchronous FPGA architecture focusing on low power consumption. The asynchronous FPGA proposed in [8] , [10] combine two handshake protocols to reduce energy consumption caused by data operations and transmissions. Reference [9] proposes autonomous power-gating scheme based on handshake protocol. However, conventional asynchronous FPGAs cannot implement handshake components efficiently since their architecture only support simple handshake sequence specialized for simple data processing and transferring. Therefore, it is difficult to design control-intensive application on conventional FPGAs.
In this paper, we propose an FPGA architecture that is suitable for handshake-component-based asynchronous circuit. The proposed architecture implements handshake components that are defined in Balsa efficiently. Therefore, the proposed FPGA is suitable for implementing complex applications. Small frequently-used handshake components are implemented on a Logic Block (LB), and other handshake components are implemented using more than one LB. As handshake components can be mapped directly on the proposed architecture, circuit designers can utilize existing CAD tools that generate a netlist of handshake components. Therefore, a design method for the proposed FPGA Copyright c 2013 The Institute of Electronics, Information and Communication Engineers is established.
Handshake-Component-Based Asynchronous Circuit Design

Handshake Component
In asynchronous circuit, synchronization between circuits is done by handshaking with a request signal and an acknowledge signal. Figure 1 shows a four-phase handshake sequence. First, active port sets the request wire to "1" as shown in Fig. 1 (a) . Second, passive port sets the acknowledge wire to "1" as shown in Fig. 1 (b) . Third, active port sets the request wire to "0" as shown in Fig. 1 (c) . Finally, passive sets the acknowledge wire to "0" as shown in Fig. 1 To design asynchronous circuits, various design methodologies has been proposed. Petrify [11] is an asynchronous circuit synthesis tool that uses a Signal Transition Graph (STG) [12] . STG describes transition sequences of wires. Therefore, STG is suitable to describe control circuits. However, it is difficult to design circuits which contain many wires. Another design methodology uses asynchronous circuit elements called handshake components. Asynchronous circuits are constructed by connecting handshake components. Handshake components were created for Figure 2 shows handshake components. Handshake components constitute a handshake circuit. Each handshake component has ports and is connected to another handshake component through a channel. Communication between handshake components is done by sending request signal from the "active" port and acknowledge signal from the "passive" port. Depending on the kind of handshake components, data signals are sent along with request signals or acknowledge signals. The number of ports of a handshake component and the width of data signal can be varied. There are 46 handshake components [13] and each handshake component is used for data processing or data path control. Figure 3 shows a Sequence component. Sequence component has an activate port and N activateOut ports. Sequence component starts handshaking sequentially from activateOut0 to activateOutN − 1. Then, handshake component connected to each activateOut port is activated. In this manner, Sequence component controls process sequence. Figure 4 shows signal transitions of a Sequence component. Arrows denote dependencies between signal transitions. The behavior of a Sequence component which has two activateOut ports is described as follows:
1. activate.req is set to "1" 2. activateOut0.req is set to "1" 3. activateOut0.ack is set to "1" 4. activateOut0.req is set to "0" 5. activateOut0.ack is set to "0" 6. activateOut1.req is set to "1" 7. activateOut1.ack is set to "1" 8. activate.ack is set to "1" 9. activate.req is set to "0" 10. activateOut1.req is set to "0" 11. activateOut1.ack is set to "0" 12. activate.ack is set to "0" As seen above, handshake components execute complex handshake sequences. However, handshake circuits are easily understandable and manageable because a function of each handshake component is clear and each handshake is symbolized by a channel and ports. Asynchronous circuits with complex process control are designed using handshake components shown in Fig. 5 . Figure 6 shows an example of a handshake circuit.
Also, there are tools that translate high-level circuit description into handshake circuit to synthesize asynchronous circuit. Thus, handshake-component-based design is suitable for complex and large-scale asynchronous circuits.
Implementation of Handshake Components
Circuit synthesis is done by replacing each handshake component with corresponding asynchronous circuit. Therefore, implementations in different technologies are obtained by providing circuit libraries. In asynchronous circuit, a hazard is serious problem [14] . A hazard is a unwanted glitch on a signal and it causes a malfunction. To guarantee correct operation of implemented application, asynchronous circuits that corresponds to handshake components should be haz- ard free.
In handshake-component-based design, implementations in different asynchronous data encodings are obtained by changing circuit libraries. Asynchronous data encoding schemes are mainly classified into
• Single-rail encoding (ex. bundled-data encoding)
• Dual-rail encoding (ex. four-phase dual-rail encoding)
Bundled-data encoding is the most common method in the single-rail encoding. Figure 7 shows a bundled-data channel. The value is encoded as in a synchronous circuit using N wires to denote an N-bit number, and control signals are encoded using dedicated wires denoted by REQ and ACK. Therefore, a channel which contains N-bit data consists of N + 2 wires. Bundled-data encoding requires the explicit insertion of matching delays in a control signal oriented in the same direction as data signal. This is because the control signal is never received before the bundled value is valid. For FPGAs, since the data path is programmable, complex programmable delay elements are required. As a result, bundled-data encoding is not suitable for FPGAs.
Four-phase dual-rail (FPDR) encoding is the most common method in dual-rail encodings. Figure 8 shows a FPDR channel. The FPDR encoding encodes a bit and a control signal oriented in the same direction as data signal onto two wires. Table 1 shows the code table of four-phase dual-rail encoding. The data value "0" is encoded as (0, 1) and "1" is encoded as (1, 0). Moreover, the spacer is encoded as (0, 0). Figure 9 shows the example where data val- ues "0", "0" and "1" are transferred. The main feature is that the sender sends spacer after a data value. The receiver knows the arrival of a data value by detecting the change of either bit: "0" to "1". In the FPDR encoding, the value is made implicit in a control signal and no delay insertion is therefore required [14] . Hence, the FPDR encoding is robust to delay variations and the ideal one for FPGAs in which the data path is programmable. In the dual-rail encoding, to transfer an N-bit value, 2N + 1 wires are required. Therefore, the FPDR encoding is employed in the proposed architecture. Figure 10 shows the overall architecture of the proposed FPGA and Fig. 11 shows the programmable interconnection resources (Connection Blocks and Switch Blocks) around an LB. The FPGA consists of mesh-connected cells like conventional FPGAs. As shown in Fig. 10 , each cell includes an LB, two Connection Blocks (CBs) and a Switch Block (SB). The upper CB connects SBs to N1, N2 and S terminals of two LBs, and the bottom CB connects SBs to E1, E2 and W terminals. The proposed architecture can implement 39 out of 46 handshake components defined in Balsa manual [13] . Handshake components that have multiple ports or wide data path can be implemented using several LBs. As mentioned in Sect. 2.2, the FPDR encoding is employed for asynchronous data encoding. Because the FPDR encoding is employed, three wires are required for a data bit. Two wires are used for a data encoded in FPDR encoding, and one wire for a request signal and an acknowledge signal. The proposed FPGA is based on Quasi-DelayInsensitive (QDI) model which assumes that gate delays and wire delays are unknown, and signal transitions occur at the same time at all end-points in wire forks [14] , [15] .
Architecture
Overall Architecture
As shown in Fig. 11 , an SB consists of diamond switches and Req/Ack modules. Diamond switches allow a data signal on a track to connect to other tracks. Figure 12 shows the structure of the Req/Ack module. The Req/Ack module consists of switches, an OR gate and the Muller Celement [14] . It allows a control signal on a track to con- nect to other tracks. In addition, two control signals can be merged using a C-element or an OR gate. The LB accesses nearby communication resources through CBs, which connects input and output terminals of the LB to SBs through programmable switches. Figure 13 shows an LB of the proposed architecture. The proposed FPGA architecture can implement 39 handshake components. The LB consists of a BinaryFunction module, a Variable module, a Sequence module, a CallMUX module, a Case module, and an Encode module. An Input switch module and an Output switch module connect modules to CBs. As mentioned in previous section, circuit synthesis is done by replacing each handshake component with corresponding asynchronous circuit. Thus, asynchronous circuits can be implemented on a conventional FPGA by replacing each handshake component with a combination of LUTs. As mentioned in Sect. 2.2, asynchronous circuits that implement handshake components should be hazard free. However, it is difficult to implement hazard free asynchronous circuits using LUTs because delay time of SBs and CBs affects circuit operations. Therefore, in the proposed architecture, each LB includes dedicated circuits for implementing handshake components.
BinaryFunction Module Structure
In handshake-component-based design, logical operation and arithmetic operation are denoted by BinaryFunction components as shown in Fig. 14. As mentioned in Sect. 3.1, the proposed architecture employs the FPDR encoding that encodes a bit and a control signal onto two wires. Therefore, acknowledge signals of BinaryFunctionIn0, BinaryFunctionIn1 and BinaryFunctionOut port are sent along with data signals. In the proposed architecture, a BinaryFunction module is used to implement a BinaryFunction component. Figure 15 shows a structure of a BinaryFunction module. the module consists of an FPDR 4-input LUT and logic gates that detect arrival of valid data and spacers. When valid signals arrive at the LUT In, Data valid becomes "1" and the LUT starts to operate. The result of the LUT is stored in the Variable module. Then, LUT ready is set to "1" and the LUT stops its operation. Figure 16 shows the structure of the LUT. For simplicity, instead of the 4-input LUT which is used in the actual LB, a 2-input LUT is shown. The LUT is implemented based on [4] and [14] . A BinaryFunction module can implement a BinaryFunction component with two 2-bit inputs or a BinaryFunction component with a 1-bit and a 3-bit input. A complex BinaryFunction component can be implemented by combining BinaryFunction modules. Fig. 26 correspond to S equenceActivate.req, S equence0.req, sequence0.ack and S equence1.req in Fig. 23 . Since activate.ack and activateOut1.ack are connected as shown in Fig. 28 (a) , there is no dedicated wires in a Sequence module. Figure 29 shows signal transitions of a Sequence module as a Sequence component. The behavior as a Sequence component is described as follows:
Variable Module Structure
1. S equenceActivate.req is set to "1" 2. S equence0.req is set to "1" 3. S equence0.ack is set to "1" 4. S equence0.req is set to "0" 5. S equence0.ack is set to "0" 6. S equence1.req is set to "1" 7. S equenceActivate.req is set to "0" 8. S equence1.req is set to "0" Fig. 23 . Figure 30 shows signal transitions of a Sequence module as a Concur component. The behavior as a Concur component is described as follows:
1. S equenceActivate.req is set to "1" 2. S equence0.req and Concur1.req are set to "1" 3.
• S equence0.ack is set to "1" following the rise of S equence0.req • Concur1.ack is set to "1" following the rise of Concur1.req 4.
• FalseVariable.ack is set to "1" following the rise of S equence0.ack and Concur1.ack • S equence0.req is set to "0" following the rise of S equence0.ack • Concur1.req is set to "0" following the rise of Concur1.ack 5.
• S equenceActivate.req is set to "0" following the rise of FalseVariable.ack • S equence0.ack is set to "0" following the fall of S equence0.req • Concur1.ack is set to "0" following the fall of Concur1.req 6. FalseVariable.ack is set to "0" A Sequence module is also used to implement Loop component and While component.
CallMUX Module Structure
A CallMUX module implements a CallMUX component shown in Fig. 31 . The CallMUX component is used to integrate input channels into a output channel. Figure 32 shows a structure of a CallMUX module. CallMUX module implements four input ports. Every input and output ports can transfer 1-bit data. Figure 33 shows signal transitions of a CallMUX module as a CallMUX component. The behavior when a data arrives at the CallMUXIn0 port is described as follows:
1. A valid data arrives at CallMUXIn0 port 2. CallMUXOut outputs the value that CallMUXIn0 re- 
Case Module Structure
A Case module implements a Case component shown in Fig. 34 . Case component selects one of the CaseOut ports according to a value that CaseIn port received, and starts handshaking. Figure 35 shows a structure of a Case module. A Case module implements four CaseOut ports. Figure 36 shows signal transitions of a Case module as a Case component. The behavior when data "0" arrives at the CaseIn port is described as follows: 
Encode Module Structure
An Encode module implements an Encode component shown in Fig. 37 . When handshake through EncodeIn k port starts, EncodeOut outputs a data "k". Figure 38 shows a structure of an Encode module. An Encode module implements four EncodeIn ports. Figure 39 shows signal transitions of an Encode module. The behavior when handshake through EncodeIn0 ports starts is described as follows:
1. EncodeIn0.req is set to "1" 2. EncodeOut outputs the data "0" 3. EncodeOut.ack is set to "1" 4. EncodeIn0.ack is set to "1" 5. EncodeIn0.req is set to "0" 6. EncodeOut outputs spacers 7. EncodeOut.ack is set to "0" 8. EncodeIn0.ack is set to "0"
As shown in Table 2 , each module implements several handshake components. Therefore, the number of the transistors of the proposed FPGA is small because of resource sharing.
Implementation of Complex Handshake Components
In the proposed architecture, each LB contains modules to implement handshake components. However, to keep a structure of LB simple, handshake components that can be implemented using an LB is limited. Therefore, in the proposed architecture, frequently-used simple handshake components are implemented using an LB, and rarely-used large-scale handshake components are implemented using multiple LBs and programmable interconnections. As a example of complex handshake components, an implementation of Variable component that stores Width-bit data is shown below. In general, Variable component has a passive port that receives a Width-bit data and N passive ports to output Width-bit data as shown in Fig. 17 . In the proposed architecture, an LB contains a Variable module that stores 2-bit data. Also, a Variable module has a 2-bit input port and two 2-bit output ports. Therefore,
LBs are required to implement a Variable component with N Width-bit output ports as shown in Fig. 40 . Table 3 Transistor count of a cell and its breakdown.
Evaluation
The proposed FPGA is implemented in e-Shuttle 65 nm CMOS process with 1.2 V supply. The circuits are evaluated by pre-layout simulation with HSPICE. Therefore, parasitic capacitance and resistance of programmable interconnection resources are not considered in evaluation results. For comparison, The conventional asynchronous FPGA architecture is implemented. Figure 41 shows the LB structure of the conventional FPDR FPGA architecture. The LB of the conventional FPGA mainly consists of an LUT, an asynchronous register, an FPDR multiplexer and an FPDR demultiplexer [14] . In the conventional asynchronous FPGA, applications are designed combining seven building blocks [5] . Table 3 shows the comparison result of the cells of the proposed architecture and the conventional architecture. Since the proposed architecture contains modules for handshake components, the transistor count of a cell is increased by 62%.
The next evaluation shows the implementation results of a 4-bit counter and a 4-bit counter with conditional branch. Figure 42 shows equivalent synchronous circuits of the test applications. Table 4 shows the comparison of cell counts and transistor counts. The benchmark circuits consist of cells and each cell includes an LB, an SB and two CBs as shown in Fig. 10 . In the case of 4-bit counter, the number of cells is reduced by 21%. However, the transistor count is increased by 27% compared to the conventional architecture as shown in Table 4 (a). On the other hand, as shown in Fig. 4 (b) the numbers of cells and transistors are reduced by 45% and 11% in the case of 4-bit counter with conditional branch. This is because handshake-component-based design can efficiently implement applications that include data path control such as conditional branch. Table 5 shows the comparison of energy consumptions per operation to count up. Compared to the conventional architecture, the energy consumptions is reduced by 9% and 27% respectively. The results show that the proposed architecture is suitable for applications with complex sequence control. Table 6 shows the comparison of throughputs. The throughput is defined by the number of operations per second. Compared to the conventional architecture, throughputs are decreased by 51% and 41% respectively. This is because handshake components execute complex handshake sequence. 
Conclusions
This paper presented an architecture of an asynchronous FPGA for handshake-component-based design. The proposed FPGA architecture implements handshake components efficiently. Thus, the proposed architecture is suitable for the synthesis tools that generate netlists consist of handshake components, such as Balsa. In addition, the handshake-component-based design is suitable for applications that have complex data path controls. Therefore, the proposed architecture is suitable to implement complex large-scale asynchronous circuits.
As a future work, hybrid architecture of the conventional asynchronous FPGA and the proposed asynchronous FPGA can be considered. The conventional asynchronous architecture is simple and it can achieve high throughput. On the other hand, the proposed architecture is suitable for applications that have complex data path controls. Therefore, employing the conventional architecture in data path and the proposed architecture in sequence controller, low power, small area and high throughput implementation would be achieved.
