This paper proposes a behavioral synthesis system for asynchronous circuits with bundled-data implementation. The proposed system is based on a behavioral synthesis method for synchronous circuits and extended on operation scheduling and control synthesis for bundled-data implementation. The proposed system synthesizes an RTL model and a simulation model from a behavioral description specified by a restricted C language, a resource library, and a set of design constraints. This paper shows the effectiveness of the proposed system in terms of area and latency through comparisons among bundled-data implementations synthesized by the proposed system, synchronous counterparts, and bundled-data implementations synthesized by using a behavioral synthesis method for synchronous circuits directly.
Introduction
Asynchronous circuits have several advantages such as average-case performance, low power consumption, and so on. However, the design of asynchronous circuits is difficult because designers must decide on a proper delay model, data encoding scheme, and control protocol according to a given application. Moreover, a hazard-free circuit must be realized because the propagation of a hazard results in circuit malfunction. Nevertheless, only a limited number of CAD tools are available.
Behavioral synthesis synthesizes an RTL model from a behavioral description specified by a programming language or its extension (e.g., C or SystemC 1) ). A †1 The University of Aizu, Japan †2 National Institute of Informatics, Japan †3 The University of Utah, USA †4 The University of Tokyo, Japan behavioral synthesis method generates an optimum RTL model through operation scheduling, resource allocation, and control synthesis while exploring the design space under design constraints. In the domain of synchronous circuits, many behavioral synthesis methods and their support tools have been developed because of the requirement for system-level design 2), 3) . This paper presents a behavioral synthesis system for asynchronous circuits with bundled-data implementation. The proposed system is based on a behavioral synthesis method for synchronous circuits. Operation scheduling and control synthesis are extended for bundled-data implementation because the execution of bundled-data implementation is different from synchronous circuits.
The proposed system synthesizes an RTL model from a behavioral description specified by a restricted C language, a resource library, and a set of design constraints. This paper presents the effectiveness of the proposed system in terms of area and latency through comparisons among bundled-data implementations synthesized by the proposed system, synchronous counterparts, and bundleddata implementations synthesized by using a behavioral synthesis method for synchronous circuits directly.
The rest of this paper is organized as follows. Section 2 gives related work. Section 3 gives background used in this paper. Sections 4 and 5 give the proposed system and its evaluation. Finally, Section 6 gives conclusions.
Related Work
There exist many behavioral synthesis methods for synchronous circuits 2), 3) . These methods schedule operations to a control step which implies a clock cycle. However, the direct application of these methods for asynchronous circuits may not synthesize optimum circuits. Let us explain the reason. Suppose we use a behavioral synthesis method for synchronous circuits to asynchronous circuits. In such a case, all operations are scheduled to a clock cycle. This ignores a characteristic of asynchronous circuits in which operations are executed immediately after the completion of previous operations. It may result in a performance loss or an extra use of resources in synthesized circuits. Even if we adjust operations so that each control step has an independent time interval using asynchronous control circuits, it does not change the execution order of operations. It means that a performance loss or an extra use of resources is not essentially solved just by changing the control scheme. This motivates us to develop a scheduling method dedicated for asynchronous circuits.
Several behavioral synthesis methods for asynchronous circuits have been also proposed. The method described in Ref. 4 ) is based on a heuristic list scheduling algorithm 2) which determines the start time of operations under resource constraints observing the availability of resources and the completion of direct predecessor operations. Compared to this method, the proposed system approximates a set of start time candidates for each operation and determine the start time of operations from the candidates considering design optimization. The method described in Ref. 5 ) explores resource sharing between operations by introducing additional dependence. However, it has the huge computational complexity (i.e., O(3 n(n−1)/2 ), where n is the number of operations).
On the other hand, the methods described in Refs. 6), 7) do not propose a new scheduling and/or allocation algorithm. Instead, these methods use templates for the control of registers derived from a synchronous or asynchronous behavioral synthesis method. Different from these methods, the proposed system extends operation scheduling and control synthesis.
The methods proposed by Venkataramani, et al. 8) and Bardsley and Edwards 9) generate asynchronous circuits from a high-level language such as C language or communicating sequential processes (CSP) 10) . Different from the proposed system where the design space is explored to generate an optimum circuit, these methods generate a circuit by a direct translation from a given specification without design space exploration.
We proposed a behavioral synthesis method for bundled-data implementation in our former work 11) . This paper extends our previous method so that control constructs such as branches and loops in a given description can be synthesized.
Background

Control Data Flow Graph
The Control Data Flow Graph (CDFG) is a directed graph which represents the data and control flow of an application. The CDFG is used as an intermediate representation in the proposed system. The CDFG G is defined as follows. G = N, BB, E N , BB, E are sets of nodes, basic blocks, and direct edges, respectively.
The node set N (N = {n i |i = 1, . . . , γ}) consists of operation nodes, variable nodes, fork nodes, join nodes, the source node, and the sink node. γ represents the number of nodes. An operation node represents a data operation labeled by an operation type (e.g., addition), a variable node represents a variable (a primary input, a primary output, or the result of an operation), a fork node represents a branch, and a join node represents a merge of branched control flows, respectively. The source and sink nodes represent the start and end of the application, respectively.
A given behavioral description can be partitioned into a set of basic blocks BB (BB = {bb k |k = 1, . . . , δ}). δ represents the number of basic blocks in a CDFG. A basic block bb k is a sequence of consecutive statements in which control flow enters at the beginning and leaves at the end without halt or branch except at the end.
The edge set E consists of directed edges e i,j which represent dependencies between nodes n i and n j . If nodes n i and n j are an operation node and a variable node, the edge e i,j represents a data dependency. If either node n i or n j is a fork node, a join node, the source node, or the sink node, the edge e i,j represents a control dependency. Figure 1 shows a CDFG. In Fig. 1 , the rectangle nodes, circle nodes, triangle node, and inverted triangle node are operation nodes, variable nodes, a fork node, and a join node, respectively. For convenience, this paper denotes an operation node, a variable node, a set of operation nodes, and a set of variable nodes as
Bundled-data Implementation
Bundled-data implementation is one of data encoding schemes for asynchronous circuits. In bundled-data implementation, N bit data transfer is represented by N + 2 signal wires. One bit data is represented by one signal wire. The two comes from the handshake signals, a request signal req and an acknowledge signal ack. Operations during a data transfer are initiated by a request signal req while the completion of operations is acknowledged by an acknowledge signal ack. To guarantee the completion of operations, a delay element is put on the req signal wire. The delay of a delay element is the sum of the maximum execution delays of resources used to execute operations. Figure 2 shows a circuit model of bundled-data implementation used in this paper. The model consists of a data-path circuit and a control circuit. The datapath circuit consists of functional units which execute an operation, registers which store an input data or operation result, and multiplexers which select an appropriate input for a functional unit or register. The control circuit consists of Q-modules 12) , glue logics, and delay elements which guarantee the control timing to write data into registers. In the proposed system, a Q-module is mapped to each state s h (h = 1, . . . , β) which is determined by the operation scheduling result. β represents the number of states. The execution time of each state is equal to the delay of a delay element on the corresponding request signal.
The control of bundled-data implementation is explained as follows. Q-module q h for state s h is activated by a rising edge of input signal in h which comes from the previous Q-module q h−1 . When an operation is executed at a shared functional unit or the result of an operation is written into a shared register in state s h , multiplexers for such shared resources are controlled by in h of Q-module q h . As resources are shared at different states, the select signal sel p for the pth multiplexer is generated from a combination of several in h signals. Then, Q-module q h asserts req h . req h returns to Q-module q h as ack h through the corresponding delay element. A register write signal write t for the t-th register is generated from ack h . When the register is shared one at different states, write t is generated from several ack h via an OR gate. As states are sequentially ordered, no Q-modules control a shared register at the same time. It implies that an OR gate is enough to control a shared register. A data is written into a register by a falling edge of ack h . After ack h is deasserted, Q-module q h passes the control to the next Q-module q h+1 with a rising edge of output signal out h .
The Proposed Behavioral Synthesis System
Synthesis Flow
The synthesis flow of the proposed system is shown in Fig. 3 . The inputs of the proposed system are a behavioral description of an application, a resource library, and a set of design constraints as inputs.
The front-end analyzes a behavioral description of an application and generates the CDFG of the application. After the CDFG is generated, the bit-width of operations and variables is analyzed. After bit-width analysis, the initial allocation is carried out. In the initial allocation, a functional unit is allocated for each operation to determine the execution time used in operation scheduling. Next, operation scheduling is applied to determine the start time of operations. After operation scheduling, a functional unit or register is allocated for each operation and variable. If shared resources exist, multiplexers are allocated. A data-path circuit is synthesized after resource allocation and binding. Before control synthesis, the state space is decided from the operation scheduling result. In control synthesis, mapping of Q-modules and generation of delay elements and glue logics are carried out to synthesize a control circuit. Finally, the proposed system generates a synthesizable RTL model and a simulation model in Verilog HDL.
The behavioral synthesis method in the proposed system is based on a behavioral synthesis method in synchronous circuits as shown in Ref. 2) . For bundleddata implementation, we extend operation scheduling and control synthesis. In the following sub-sections, this paper describes the detail of the behavioral synthesis method in the proposed system. It is based on our former work 11) . This paper describes how behavioral descriptions not only data operations but also control constructs such as branches and loops are synthesized.
Inputs and Output of the Proposed System
The inputs and output of the proposed system are listed below.
• Inputs -A behavioral description of an application -A resource library Table 1 The C language syntax supported in the proposed system.
Integer type constants and variables
Assignments if switch for while do-while Table 2 Parameters in a resource library.
Parameters in a resource library Area
The maximum execution delay Executable operations The bit-width of inputs and output -A set of design constraints • Output -An RTL model and a simulation model of bundled-data implementation The behavioral description of an application must be written by the C language syntax shown in Table 1 . Otherwise, the proposed system terminates the synthesis process with an error. Input signals and output signals can be explicitly specified in the behavioral description using "pragma". Each resource in a resource library is parameterized with parameters shown in Table 2 . Resource parameters must be specified in an XML format and resources must be prepared as synthesizable RTL models in Verilog HDL. A set of design constraints may have a time constraint or a set of resource constraints used for operation scheduling, a delay margin to generate delay elements. A set of design constraints must be specified in an XML format.
Note that the proposed system is not restricted on to the C language. If we can provide a proper front-end, the proposed system can synthesize bundled-data implementation from other languages as well.
The proposed system generates two circuit models. One is a synthesizable RTL model for implementation and the other is a simulation model for functional verification. In the simulation model, arbitrary short delays are inserted for all feedback loops. This is because logic simulators cannot generate correct values if there is no time difference between input signals and feedback loops. In addition, the delays of delay elements are explicitly represented by exact times even though delay elements in the synthesizable RTL model are represented by logic gates. Designers can synthesize and simulate these models using a conventional logic synthesis tool or an HDL simulator.
Front-end
The front-end analyzes and optimizes a given behavioral description using COmpiler INfra Structure (COINS) 13) . COINS supports optimization in compliers such as common sub-expression elimination and generates an intermediate format called High-level Intermediate Representation (HIR) which looks like a syntax tree. The front-end generates a CDFG from a generated HIR.
Bit-width Analysis
The bit-width of operations and variables is analyzed using the method in Ref. 11). In bundled-data implementation, bit-width analysis is one of the important processes for optimization. Delay elements in bundled-data implementation are generated so that the delays of delay elements are larger than the maximum delays of used resources. Therefore, if we can use resources with short delays, the performance of bundled-data implementation can be improved. In general, the bit-width of a resource is shorter, the delay of the resource is shorter.
Initial Allocation
A functional unit is allocated for each operation node o i to decide the maximum execution delay used in operation scheduling. The maximum execution delay of operation o i is represented by d(o i ).
In time constraint scheduling, the main objective is to minimize the number of resources used in a data-path circuit. To maximize resource sharing, for the same type of operations, the proposed system allocates the resource which can execute the operation with the maximum bit-width. On the other hand, in resource constraint scheduling, the main objective is to minimize the latency of a data-path circuit. Therefore, for each operation, the resource whose bit-width matches to the operation is allocated. If there is no suitable resource in a given resource library, the proposed system asks designers to prepare such a resource before synthesis.
As the use of only functional unit delays may violate a given time constraint after synthesis due to the allocation of registers and multiplexers, the proposed system adds one register delay and two multiplexer delays to each d(o i ). Two multiplexer delays correspond to a multiplexer used for a functional unit and a multiplexer used for a register. A multiplexer delay is estimated from the average number of functional unit sharing or register sharing in the As Late As Possible (ALAP) schedule 2) where operations o i are scheduled to the latest start time alaps i under a given time constraint.
Operation Scheduling
Operation scheduling determines the start time of each operation. The proposed system supports time constraint scheduling and resource constraint scheduling. The objective of time constraint scheduling is to decide start times minimizing the number of resources while the objective of resource constraint scheduling is to decide start times minimizing the latency. The proposed system uses the Asynchronous Force-directed Scheduling (AFDS) algorithm 15) as a time constraint scheduling algorithm while the Asynchronous Force-directed List Scheduling (AFDLS) algorithm as a resource constraint scheduling algorithm. Both algorithms are based on the FDS and FDLS algorithm 14) developed for synchronous circuits.
In the FDS and FDLS algorithm, operations are assigned to one of control steps which have a uniform time interval. These control steps represent clock cycles. On the other hand, in the AFDS and AFDLS algorithms, control steps are determined from sets of approximated start time candidates of operations. In such a case, the time intervals among control steps are not uniform. The reason why control steps are decided so is that operations in bundled-data implementation are executed immediately after the completion of a previous operation. Note except the decision of control steps there is no big difference between the FDS (FDLS) and the AFDS (AFDLS) algorithms. Therefore, other scheduling algorithms which use control steps can also be extended in the similar way.
This sub-section describes the overview to determine control steps from sets of approximated start time candidates of operations, the AFDS algorithm, and the AFDLS algorithm. Figure 4 shows the function ApproximateStep which is used in the proposed 
Approximation of Start Times
Cand = Cand ∪ Cand i 10 end for 11
Step = ComputeStep(G, Cand); 12 return Step; Before the approximation of start time candidates, the As Soon As Possible (ASAP) and ALAP schedules are calculated. The ASAP schedule determines the earliest start time asaps i for each operation o i 2) . The completion times in the ASAP and ALAP schedules denoted as asapc i and alapc i are the sum of asaps i or alaps i and d(o i ).
From asaps i and alaps i for each operation o i , the time frame F rame i where operation o i can be scheduled without violating a given time constraint is calculated. The time frame F rame i is defined as follows.
F rame i = alaps i − asaps i After time frames are calculated, a set of start time candidates for each operation o i is approximated from the completion times of direct predecessor operations, concurrent operations, and mutually exclusive operations. This paper denotes a set of start time candidates, a set of direct predecessor operations, a set of concurrent operations, and a set of mutually exclusive operations for operation o i as Cand i , Direct i , Conc i , and Mutex i , respectively. Direct i is defined as follows.
For Mutex i and Conc i , a set of transitive predecessors and successors for operation o i is calculated. We represent a transitive relation i → j if there is a path of edges (e i,x , . . . , e y,j ) from node n i to node n j . A set of transitive predecessors and successors for operation o i is denoted as T i and calculated as follows.
is neither a transitive successor nor a transitive predecessor for operation o i and belonging to a different basic block from the basic block of operation o i .
is neither a transitive successor, a transitive predecessor, nor a mutually exclusive operation for operation o i and asapc j is a value between asaps i and alaps i .
Finally, Cand i for operation o i is calculated from Direct i and Conc i as follows.
Note we assume that st is a positive real number. This paper calls o j as a previous operation. To calculate start time candidates more, Cand i for operation o i is calculated recursively from the execution sequences of previous operations.
After the approximation of start time candidates, the union set Cand (Cand = st|st ∈ Cand i ) of all Cand i is calculated. Sorting start time candidates in Cand in the ascending order, the proposed system decides Step by translating each start time candidate in Cand to a positive integer which corresponds to step w . Similarly, Cand i for each operation o i is translated into a set of schedulable steps
Step i (Step i ⊆ Step). Step = ApproximateStep(G, tc);
Step, P rob);
Step, P rob, DG); Figure 5 shows the AFDS algorithm. The inputs of the AFDS algorithm are a CDFG G and a time constraint tc. The output is a set of start times for operations denoted as Start.
The AFDS algorithm
The following processes are repeatedly executed until all operations are scheduled. At first, the AFDS algorithm calls the function ApproximateStep to decide control steps. Next, for each step w ∈ Step i , the probability P rob(i, w) that an operation o i is scheduled to step w is calculated. Then, for each operation type (e.g., addition), Distribution Graphs (DGs) which represent the resource utilization of each resource are calculated from P rob(i, w). DGs are calculated for each basic block independently and then merged to generate the entire DGs for the CDFG. After the calculation of DGs, the cost function called self force Sf (i, w) is calculated for each step w . The self force Sf (i, w) represents how resource utilization is balanced through control steps when an operation o i is scheduled to a control step step w ∈ Step i . Finally, the operation o i which has the minimum Sf (i, w) is scheduled to the control step step w . This paper represents the start time of operation o i as start
The AFDLS algorithm Figure 6 shows the AFDLS algorithm. The inputs of the AFDLS algorithm are a CDFG G, a resource library R, and a set of resource constraints RC.
Initially, currentstep which represents the current referenced control step is set to 0. The following processes are repeatedly carried out until all operations are scheduled. For each resource r ∈ R whose resource constraint rc r is more Fig. 6 The AFDLS algorithm. than 0, a set of operations that can be scheduled to currentstep using resource r and a set of executed operations that are already scheduled to a previous step using resource r but not finished at currentstep are calculated. These sets are denoted as Op and ExecutedOp. If the number of Op plus the number of ExecutedOp is less than resource constraint rc r , all operations in Op are scheduled to currentstep. Otherwise, operations less than rc r − |ExecutedOp| are scheduled based on self forces. For the calculation of self forces, the same processes as the AFDS algorithm are carried out. Only difference is that although self forces in the AFDS algorithm are calculated for each schedulable step of all operations, self forces in the AFDLS algorithm are calculated only for currentstep of operations in Op.
After the scheduling at currentstep, the next control step is decided from the earliest completion time of scheduled operations.
Resource Allocation and Binding
The proposed system allocates a functional unit for each operation and then a register for each variable. For shared functional units and registers, the proposed system allocates multiplexers.
The proposed system uses an extension of the Left-Edge (LE) algorithm 2) called the Extended Left-Edge (ELE) algorithm 11) . The difference between the LE algorithm and the ELE algorithm is that the ELE algorithm uses a priority for resource allocation calculated from the bit-width, inputs, and output of operations. The objective of the ELE algorithm is to minimize the bit-width of allocated resources and the number of allocated multiplexers. The ELE algorithm initially sorts operations in Op by the ascending order of start times. Op is set to a set U nalloc which represents unallocated operations and 0 is set to Alloc. Alloc i represents the index of the allocated functional unit for operation o i ∈ Op.
Then, the following processes are carried out until U nalloc becomes the empty set. For each control step step w , a subset of operations whose lifetime intersects to step w is calculated. It means that operations are executed at step w . This subset is represented as SubO ⊆ Op. In addition, a set of available functional units at step w is calculated. This set is represented as Avail ⊆ F u. Functional units fu c ∈ Avail are allocated to operations o i ∈ SubO in the function Allo-
AllocateResource(Op, SubO, Avail, U nalloc, Alloc); 8 end for 9 end while 10 return Alloc; Fig. 7 The ELE algorithm for functional unit allocation. 
cateResource.
In the function AllocateResource, Io and Bit are calculated for each pair of o i ∈ subO and fu c ∈ Avail. Io(i, c) represents the number of the same inputs and output among operation o i and operations o j ∈ Op when the same functional unit fu c as o j is allocated to o i .
represents the number of the same inputs and output between operations o i and o j . The same inputs mean that the sources of operations are the same while the same output means that the destination of operations is the same. The outputs of operations become the same when mutually exclusive assignments for the same variable exist in branches (e.g., o 6 and o 11 in Fig. 1) . A higher value in Io(i, c) implies that the number of inputs for the multiplexers used for the functional unit fu c and a register to store the operation result is reduced more.
Bit(i, c) represents the difference of the bit-width among operation o i and operations o j ∈ Op when the same functional unit fu c as o j is allocated for o i . Here, b i and b j represent the bit-width of operations o i and o j , respectively.
If b j minus b i is more or equal to 0, the difference between b j and b i is accumulated to Bit(i, c). A smaller value in Bit(i, c) means that the difference of the bit-width among operation o i and operations o j is large. It implies that many bits in the functional unit fu c are not utilized by operation o i . Such an allocation is not suitable in the view of resource sharing.
After the calculation of Io and Bit, a functional unit is allocated for each operation in SubO until SubO or Avail becomes the empty set. Resource allocation is carried out from the combination of operation o i ∈ SubO and functional unit r c ∈ Avail where Io has the maximum value. If there are more than two combinations, the combination that the value of Bit is 0 or the closest to 0 is selected. Figures 9 and 10 show the ELE algorithm and the function AllocateResource for register allocation. Instead of Op and F u in the ELE algorithm for functional unit allocation, the variable set V and the register set Reg are given as arguments. The procedure is mostly the same as functional unit allocation. For each control step step w , a subset of variables (SubV ) whose lifetime intersects to step w and a set of available registers (Avail) at step w are calculated. Then, registers reg c are allocated based on the priority calculated from the number of the same input/output and the difference of the bit-width among variables.
After functional unit and register allocation, multiplexers are allocated for shared functional units and registers.
Avail = GetAvail(Reg, LT stepw); 7
AllocateResource(V , SubV , Avail, U nalloc, Alloc); 8 end for 9 end while 10 return Alloc; Fig. 9 The ELE algorithm for register allocation.
Alloc i = index(regc); 7 U nalloc = U nalloc \ {v i }; 8 end while Finally, resources in a given resource library are bound for allocated functional units, registers, and multiplexers. During resource binding, resources which have the enough bit-width are bound.
Control Synthesis
Before control synthesis, the proposed system calculates the state space of a synthesized circuit from the scheduling result. Then, a control circuit is synthesized through mapping of Q-modules and the generation of glue logics and delay elements.
State Allocation
The proposed system extends the state allocation method proposed by Tseng, 16) so that states are determined by the start times of operations. The proposed system supports the following slicing methods.
• The local slicing • The global slicing simple • The global slicing complex In the local slicing, states are determined from the set Start. An interval between start times becomes a state s h . Different states are allocated for mutually exclusive basic blocks. In the global slicing simple, several states in different basic blocks are merged if the interval for the states is equivalent. The global slicing complex is an extension of the global slicing simple in that not only start times but also completion times are used for state allocation.
Mapping of Q-modules and generation of glue logics.
The proposed system maps a Q-module q h to each state s h . Then, glue logics are generated. A multiplexer select signal sel p for the p-th multiplexer is generated from a glue logic which comes from input signals in h of Q-modules q h . A register write signal write t for the t-th register is generated from a glue logic which comes from acknowledge signals ack h of Q-modules q h .
Insertion of Delay Elements
The delay sd h for state s h is the maximum path delay which is calculated from the sum of the delays of used resources in the state. The proposed system generates delay elements with buffers. As data are written into registers by a falling edge of ack h , every delay element is passed to twice. The first is from a rising edge of req h to a rising edge of ack h and the second is from a falling edge of req h to a falling edge of ack h . It implies that the required delay of a delay element is a larger value than sd h /2.
Usually, the delay in state s h becomes long after the physical design due to wire delays. Moreover, the delay may be changed by technological or environmental variations. Therefore, the proposed system generates delay elements with a margin margin specified in a given constraint file. The number of buffers in a delay element is decided so that the delay of the delay element is larger than margin * sd h /2.
Experimental Results
This section shows the effectiveness of the proposed system comparing the synthesized RTL models of bundled-data implementations using the proposed system with the synchronous counterparts and the bundled-data implementations using a behavioral synthesis method for synchronous circuits. This paper calls latter bundled-data implementations as direct implementations. For the experiments, the proposed system is implemented in Java. The FDS and FDLS algorithms and a finite state machine (FSM) generator are also implemented for the synthesis of synchronous circuits. Note that in the experiments optimization techniques such as pipelining and chaining are not concerned. They will be considered in our future work. The experiments are carried out on a Windows machine which has a dual-core processor (2.66 GHz) and a 2G memory. Table 3 shows the statistics of benchmarks used in the experiments. These benchmarks are downloaded from Refs. 17), 18) and modified to satisfy supported syntaxes shown in Table 1 . The columns in Table 3 represent the name, the number of operations, the number of basic blocks, the number of branches, and the number of loops in benchmarks, respectively. Table 4 shows a part of a used resource library. Each resource is modeled by Verilog HDL and synthesized by using Xilinx ISE WebPACK 9.2i 19) targeting Virtex4 (xc4vlx15-12sfs623) FPGA. The columns of Table 4 represent the name, bit-width, area, and delay of each resource. The unit of area is slice. A slice consists of two flip-flops and two 4-to-1 look-up tables (LUTs). As multipliers can be implemented on not slices but embedded multipliers in Vertex4, the number of slices for multipliers is set to 0. In the proposed system, the AFDS or AFDLS algorithm, the ELE algorithm, and the global slicing simple are used for operation scheduling, resource allocation, and state allocation. No time margin is assigned to generate delay elements. For the synchronous counterparts, the FDS or FDLS algorithm, the ELE algorithm, and the global slicing simple are used. The control circuits in the synchronous counterparts are generated by using the FSM generator. The direct implementations are synthesized by using the FDS or FDLS algorithm, the ELE algorithm, and the global slicing simple for data-path circuits and Q-modules for control circuits.
The time interval of control steps when the FDS or FDLS algorithm is used is decided as follows. First, we find two operations from the initial allocation result. One has the minimum operation delay and the other has the maximum operation delay. We synthesize benchmarks by changing the time interval 0.1 by 0.1 from the minimum operation delay to the maximum operation delay. The time interval which synthesizes an optimum RTL model of the synchronous counterpart is selected.
The first comparison shows the number of resources and the number of slices for the synthesized RTL models under a time constraint. Table 5 shows the experimental results. The rows in "async" represent the results in the proposed system, the rows in "sync" represent the synchronous counterparts, and the rows in "direct" represent the direct implementations. For each benchmark, behavioral synthesis is carried out for three time constraints. The first constraint corresponds to the critical path delay of each benchmark derived by the ASAP algorithm. The second and third constraints correspond to the critical path delay * 1.5 and the critical path delay * 2.0, respectively. The column "t step " represents the time interval of control steps when the FDS algorithm is used. The columns "FUs", "Regs", and "Muxs" in "resource usage" represent the numbers of functional units, registers, and multiplexers in the synthesized RTL models, respectively. The column "states" represents the number of states in the synthesized RTL models. The column "area" represents the number of slices when logic synthesis is carried out for the synthesized RTL models using ISE. The columns "S", "RA", "CS", and "others" in "run-time" represent the times for scheduling, resource allocation, control synthesis, and other processing. The column "total" represents the total behavioral synthesis time.
Note the symbol "-" in area means that ISE cannot synthesize logic circuits because of their substantial large state space. Logic synthesis by ISE is frozen after the whole memory space on our environment is utilized. Another note, for benchmarks usqrt and fdct, we verify the functional correctness using the generated simulation models and an HDL simulator ModelSim 20) .
The second comparison in Table 6 shows the latency of the synthesized RTL models under a set of resource constraints. For each benchmark, we synthesize RTL models two times by changing the number of functional units arbitrary. The number of functional units is shown in the column "rc". Similar to the first comparison, the rows in "async" represent the results in the proposed system, the rows in "sync" represent the synchronous counterparts, and the rows in "direct" represent the direct implementations. The column "latency" represents the latency of the synthesized RTL models. The values in "async" and "direct" are the sum of the state delays while the values in the synchronous counterparts are the product of t step by the number of states.
Discussion
Area. As the main objective of time constraint scheduling is to minimize area, we discuss the impact of area in the proposed system referring to Table 5 .
Compared to the synchronous counterparts, the area of synthesized circuits using the proposed system is slightly large. As a buffer of delay elements is implemented by one slice in FPGAs, the large portion of the area overhead is occupied by delay elements. To reduce the area overhead, one may consider the optimization of delay elements. It can be realized by utilizing the delays Information and Media Technologies 4(2): 211-226 (2009) reprinted from: IPSJ Transactions on System LSI Design Methodology 2: 64-79 (2009) © Information Processing Society of Japan Circuits where the number of resources is less than the synchronous counterpart are such candidates. For example, in the case of usqrt with time constraint 86.0 ns, the number of slices used in the delay elements is 42. If we can reduce more than 10 slices in the delay elements, the area of the synthesized circuits by the proposed system becomes less than the synchronous counterpart. This optimization will be considered in our future work. Compared to the direct implementations, the experimental results may not show a large difference between the synthesized circuits by the proposed system and the direct implementations. The proposed system synthesizes better circuits or worse circuits which depends on the number of resources. This is because the heuristic nature of the proposed system. However, the proposed system has much more possibility to synthesize the best circuit. mdct is such a case. As a different data-path circuit is synthesized by using non-uniform control steps, the difference in the number of resources results in less area. On the other hand, in the direct implementations, it is difficult to synthesize a better circuit than the synchronous counterpart. This is because the data-path circuit is the same as the synchronous counterpart, but the control circuit has delay elements.
Latency. As the main objective of resource constraint scheduling is to minimize latency, we discuss the impact of latency in the proposed system referring to Table 6 .
Compared to the synchronous counterparts and the direct implementations, the proposed system synthesizes the best circuits in many cases (e.g., all cases of fdct, decoder, and pred1). Operations are scheduled so that they are executed immediately after the completion of previous operations using non-uniform control steps. In addition, the use of non-uniform control steps results in different schedules compared to the synchronous counterparts. On the other hand, the latency improvement of the direct implementations for the synchronous counterparts is restricted. This is because the same scheduling results are utilized although the control schemes are different.
Synthesis time. In behavioral synthesis under time constraints, the proposed system takes more time for scheduling because control steps are updated whenever an operation is scheduled. On the other hand, there is no big difference in behavioral synthesis under resource constraints. This is because the proposed system approximates start time candidates at control steps where the number of available functional units is less than the number of schedulable operations. From the experimental results, we can say that the proposed system is preferable for behavioral synthesis of bundled-data implementations in that in many cases the proposed system synthesizes better circuits in terms of area and latency than direct implementations. Moreover, in several cases, the proposed system synthesizes better circuits than synchronous counterparts not only in latency but also in area. It is the effect of non-uniform control steps used in the proposed system.
Conclusions
This paper proposes a behavioral synthesis system for asynchronous circuits with bundled-data implementation. The proposed system is implemented in Java and evaluated through the experiments. The experimental results show the effectiveness of the proposed system in that the synthesized bundled-data implementations are superior to the synchronous counterparts and the direct implementations in many cases.
As our future work, we are going to extend the proposed system to synthesize a behavioral description with arrays and floating point operations. Moreover, pipelining, chaining, and other optimization techniques will be implemented. 
