A method for automating the synthesis of asynchronous control circuits from high level (CSP-like) and/or partial STG (involving only functionally critical events) specifications is presented. The method solves two key subtasks in this new, more flexible, design flow: handshake expansion, i.e. inserting reset events with maximum concurrency, and event reshufling under interface and concurrency constraints, by means of concurrency reduction. In doing so, the algorithm optimizes the circuit both for size and performance. Experimental results show a significant increase in the solution space explored when compared to existing CSP-based or STG-based synthesis tools.
Introduction
Specifying an asynchronous circuit is a cumbersome and errorprone task because the designer has to define the behavior of every signal at every moment of time. Although the value of a signal might be sometimes irrelevant to the general functioning of the system, one must be specific about its behavior by exactly defining whether the signal is stable at 0 or 1 or making a rising or falling transition.
To circumvent this problem, the designer should be able to specify the behavior of a circuit by only defining those events that are relevant to its function -they are calledfunctionul events. The rest of the events (non-functional) can be defined arbitrarily under the requirement of preserving the correctness of the circuit behavior. This is exemplified by the gate-level implementation of a rising edge-triggered flip-flop. Only the rising edge is "functional", and must have a precise relationship with the input and the output signals (setupbold constraints and output delay respectively). The falling edge can occur almost at any time between two consecutive rising edges. In the asynchronous context, this kind of freedom provides additional mom for optimization under different cost functions aimed at area and/or performance.
There are various design scenarios in which this approach may be useful:
1. The designer concentrates on the key functional aspects and, e.g., specifies only the rising edges of signals. A tool automatically inserts non-functional events. Even when all events are functional, there is some freedom in making them either ordered or concurrent. The designer restricts some functionally important concurrency/ordering relations and allows the tool to choose how to reduce concurrency and optimize the circuit.
Permission to make digital or hard copies of all or part of this work for p e r~0~1 or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. (return-to-zero signal transitions in four phase expansion of the channels) for optimizing area, performance or power.
In this paper we solve the problem of handshake expansion in a canonical fashion, by inserting "reset" events with maximum concurrency with respect to the other signals. We then solve the problem of reshuffling by only considering the operation of concurrency reduction.
The idea of using concurrency reduction as an efficient method in the optimization loop was first proposed in [5] . The main distinctive features of our approach with respect to that work are:
1. The reduction mechanism is applied in a wider framework (handshake expansion, reshuffling), instead of working at the level of completely specified State Graphs.
2.
A reduction based on removal of State Graph arcs is used, instead of coarser techniques based on removal of states.
3. Not every form of concurrency reduction can be modeled by a sequence of pairwise reductions. In [3] and Section 5. a more general (albeit expensive) technique is discussed.
4.
The reduction procedures presented in this paper are aimed at the general minimization of logic, instead of only solving the CSC problem.
In the rest of the paper, after Section 2, devoted to theoretical background, and Section 3, devoted to an informal overview, we will answer the following questions: (1) How is concurrency exploited starting from a partial specification of an asynchronous controller? (Section 4); (2) What are the valid reductions of concurrency? (Section 5); (3) How can Concurrency be reduced by iterative application of a single, elementary operation? (Section 6); (4) How is the quality of the solution estimated? (Section 7). Section 8 presents experimental results.
reshufling : selecting the order of some non-functional events
Theoretical background
This section assumes the reader to be familiar with Petri nets [7] . Figure 1 .a shows a timing diagram of a simple controller between an asynchronous memory and a processor. An operational cycle is triggered by the processor requesting data (Req goes high). After this request, memory prepares data and the controller replies with an acknowledgment (Ack goes high). From now on the processor can reset the request and immediately start a new cycle. Note The set of all signals is partitioned into a set of inputs, which come from the environment, and a set of outputs and state signals that must be implemented.
lmplementability conditions In addition to consistency, the following two properties are required for an SG to be implementable into a hazard-free asynchronous circuit.
The first property is speed independence, with three constituents: It can be easily shown that for a speed-independent S G two output events a and b are concurrent iff their ERs intersect:
In the SG of Figure 1 .d transition Req+ is enabled in states l*O* and 00* (ER(Req+)={ l*O*,OO*}) while Ack-is enabled in 1*0* and 1*1 (ER(Ack-)={l*O*,l*l}). Excitation regions of these transitions intersect, thus implying that the corresponding transitions are concurrent.
Overview of the method
We illustrate our methodology by means of an example. Figure 2 .a shows the structure of an LR-process [6] using the "handshake component" notation [l] . The process has a passive port I and an active port r . It transfers control from the left port to the right port. Figure 2 .e shows an STG with maximal concurrency for all falling transitions, assuming that all signals are independent, and that no interface constraints were given. This handshake expansion however is not valid for the LRprocess. Indeed, we should obey additional ordering constraints for the channels: never reset the requesting signal before receiving the acknowledgment. For example, for a passive port 1 one should satisfy the following interleaving of signal transitions:
*pi+; lo+; zi-; lo-]
Similarly for the active channel. Figure 2 .f presents a valid handshake expansion with maximal concurrency for the LR-process taking interface constraints into account. Table 1 presents the area and performance results for different implementations of the LR-process. The row "Max. concurrency" corresponds to the implementation of the STG with maximum concurrency of the reset signal transition. The circuit area is 168 units. Assuming that all internal and output events have a delay of 1 time unit, and that all input events have a delay of 2 time units, the critical cycle is 13 units and contains 3 input events. Other implementations are shown in Figure 3 . 
Handshake expansion
This section explains how handshake expansion is performed. The syntax of our specifications allows one to describe the behavior of channels and partially specified signals. In both cases, the specification only contains the acfive transitions, whereas the handshake expansion method transforms the specification according to the refinement chosen by the designer: 2-phase rejinement, with no distinction between up and down transitions, or 4-phase refinement, with return-to-zero signaling for each handshake.
Partially specified signals. The STG transformation required to expand a partially specified signal is shown in Figure 5 .a and b. Figure 5 .a illustrates an additional return-to-zero transition that must be connected (using the places labelled rdy and rt z ) to the functional part corresponding to the rising transition of the signal, shown in Figure 5 .a. Note that each rising transition is enabled only when the return-to-zero transition has fired (arc rdy -+ b+). The return-to-zero transition is enabled as soon as the rising transition has fired (arc b+ + r t z).
Channels
. For channel refinement we use a notation similar to that proposed for handshake processes [ 11. Two types of events can occur in channel a: input events (a?) and output events (a!). The terminals of a channel are calledports. A channel a is implemented by two signals: ai (input) and a, (output).
The expansion from channel to signal events can be done by manipulating the structure of the underlying Petri net. transitions from a? to &and a! to a< where the suffix-denotes a transition toggling the value of the signal. The expansion to a 4-phase protocol is performed by relabeling transitions and inserting return-to-zero events. The transformations perfomed at the STG level consist of adding a return-to-zero structure and defining multiple instances of the transitions representing channel events. The return-to-zero structure corresponding to a channel is depicted in Figure 5 .c. The place req indicates that the channel is ready for a new handshake. The place ack indicates that the channel has received a request (a? for passive and a! for active handshakes) and will perform an acknowledgment (a! for passive and a? for active handshakes). The places p-rt z (for passive) and a-rt z (for active) receive a token as soon as the handshake is complete and activate the return-to-zero transitions. This scheme allows a channel to act both as an active and as a passive port at different instants of the behavior of the system. gives an overall picture of the channel behavior in the set and reset phases. Note that the specification must properly interleave the events on the channel according to the handshake protocol, otherwise the expansion would produce an inconsistently encoded STG. This scheme guarantees the m i m u m concurrency for the return-to-zero sequence, that is then exploited by the concurrency reduction algorithm described in Section 3. Figure 6 presents an example illustrating all the above transformations. The original specification (Figure 6 .a) has a channel (a), a partially specified signal (b) and a completely specified signal (c). Two-phase and four-phase refinements of the same specification are shown in Figure 6 .b.c.
Example.

Concurrency reduction
In this section we develop the theoxy and algorithms that allow us to explore only valid reductions of concurrency more efficiently than by working on a state-by-state basis. In particular, our notion of concurrency reduction is related to the introduction of places (causal constraints) at the STG level, and then "fixing" the STG so that consistency and speed-independence are preserved.
Valid concurrency reduction should preserve certain properties.
Let A be the initial SG and Ared be a reduced SG. Reducing concurrency for event e means truncating some ERs of this event. In other words, some of the arcs labeled with e are removed from the SG as a result of concurrency reduction. This may cause some of the states to become unreachable and to be removed from the SG.
No states or arcs not present in the initial SG can appear in Ared. This trivially implies that consistency, commutativity, and determinism of the SG cannot be violated as a result of concurrency reduction. Also no new CSC conflicts can appear (in fact some or all of the conflicts can disappear due to state removal).
Validity then requires the following properties to be satisfied after concurrency reduction: Speed-independence is preserved: as noted above, commutativity and determinism are automatically preserved, so the only constraint is that if A is output persistent, then Aced must be output persistent. Whenever concurrency is reduced for an output signal, one must also make sure that this is reflected in the specification of the behavior assumed by the environment (e.g., by another design team). Otherwise, concurrency reduction may introduce deadlocks in the composition of the circuit and the environment, e.g., if the environment expects b after a and the circuit provides b before a as a result of two conflicting concurrency reductions for initially concurrent events a and b.
Definition 5.1 (Valid reduction)
Ifa reduced SG satisfies allproperties (1)-(4) above, then the concurrency reduction is valid.
6
The algorithm sketched in Figure 7 defines our basic operation for concurrency reduction, calledforward reduction. It takes two concurrent events as parameters. Concurrency is reduced for the first event (a). The second event (b) defines the set of states ER(a) f l ER@) in which concurrency for a should (at least) be reduced in one step. In the simplest case, when events enabled in ER(a) are persistent, and E R ( a ) has only one minimal state (a state is minimal in an E R if it has no predecessors in the ER), F w d R e d ( a , b) creates an arc from event b to event a at the STG level.
The application of the forward concurrency reduction F w d R e d to an STG with choice (non-persistency) and Concurrency is illustrated in Figure 8 . The reduced SG corresponds to an STG with no concurrency between ( a , b), (a, e ) , and ( a , d) . Hence, in general reducing concurrency for a pair of events can also reduce concurrency for some other pairs. Note that in lines 1,2 of F w d R e d , states are removed from the ER of event a, not from the SG. I.e., at this step only arcs labeled with a can be removed from the SG.
The following proposition shows that iterative application of F w d R e d to an SG results in a valid concurrency reduction. 
The basic operation: forward reduction
: remove u n r e a c h a b l e s t a t e s and t h e i r o u t p u t a r c s 4 :
If e x i s t s some e such t h a t ER(e) = 0 or Therefore, our practical implementation described in the next section is restricted to the application of FwdRed.
Implementation
As we mentioned in Section 3, concurrency reduction can reduce the logic complexity of the circuit in two ways. First of all, the number of CSC conflicts is reduced, and hence the complexity of the logic implementing the state signals is reduced. Secondly, the number of reachable states is reduced, and hence the don't care set for logic minimization is increased. However, in case one signal becomes ordered with another, the support of its boolean function increases. For this reason, we use a heuristic cost function that estimates changes in logic complexity at each step, since exact computation by state signal insertion, decomposition and technology mapping would be too expensive.
The algorithm in Figure 9 describes how concurrency reduction is performed. The designer initially provides a list of pairs of events whose concurrency cannot be reduced, e.g., because they are crucial for overall system performance. This will prevent the algorithm from adding causality relations between these pairs of events.
pruning commonly used in game-playing algorithms. At each level of The exploration is done by a strategy similar to the athe exploration from a given configuration, a set of neighbor configurations is generated by performing a basic transformation (forward concurrency reduction between two events). For each level of the exploration, only a few candidates, with the best estimated cost, survive to the next level. These candidates are kept in the list f r o n t i e r . The width of the exploration is controlled by the parameter s i ze-f r o n t i e r.
Note that at each level of the exploration the obtained state graphs are less concurrent than their predecessors. This monotonous behavior guarantees that the algorithm will terminate when no more concurrency can be reduced in the current search space.
The cost function to select the best configurations at each level aims at reducing the complexity of the resulting circuit. Unfortunately, the estimation of the complexity of the logic for output signals with CSC conflicts can be inaccurate due to the impossibility to derive correct equations. 
Experimental results
The techniques presented in this paper have been implemented in the tool p e t r i f y [4]. After handshake expansion and concurrency reduction, circuits have been derived by using previously published synthesis techniques for speed-independent circuits. The final area was obtained by decomposing the circuit into 2-input gates and mapping the network onto a gate library. The decomposition was performed by preserving the speed-independence of the circuit. Our tool can automatically perform a 4-phase expansion by using the structural techniques discussed in Section 4, and derive the specification shown in Figure 10 .b. After this transformation, the return-to-zero signalling is performed with maximum concurrency. However, a direct implementation of this behavior would result in a complex circuit due to the need of inserting extra logic for state encoding and logic decomposition (twice as complex as Figure 10 .e).
Figures 10.d.e depict the solution automatically obtained by reducing the concurrency of the 4-phase refinement in Figure 10 .b. The reduction has been performed by preserving the concurrency between the events b? and c?, thus maintaining the parallel execution of both processes. Interestingly, the circuit manifests an asymmetric behavior that can be beneficial to implement PAR components in which the process at channel b is known to be slower than that at c. The circuit is slightly smaller (by 12% in our standard cell library) than the known manual design. However, its estimated perfomance ma be worse than that of Figure lO Second case study: the MMU controller In [8] it was shown that by using timing assumptions on the behavior of the environment, it is possible to reduce the area of an asynchronous Memory Management Unit control circuit by over 50 %, with respect to the original speed-independent implementation. Our experiments PEsented in Table 2 show that approximately the same area improvement can be reached without sacrificing speed-independence, if we are allowed to use flexibility in playing with concurrency of the reset transitions of the four-phase protocol. A combination of our high-level transformation and Myers' lower level timing optimizations can conceivably provide even better optimization results.
Y
We can conclude that:
0 With respect to the original solution, reshuffling can yield an area reduction to less than one half. 
Conclusions
Specifying the behavior of an asynchronous system is a complex task that needs to be performed at the appropriate high level of a b straction. Reasoning in terms of actions (or events) and communication channels allows the designer to describe a behavior without worrying about the implementation details.
This paper has presented a method to automate the decisions taken at the lowest levels of circuit synthesis, concerning phase refinements and event reshuffling. Thus the designer is only left the task of defining the causality among actions and specifying the desired concurrency in the system. The task of translating actions into signals transitions is automatically handled by CAD tools.
Some aspects still require further research. In particular, better logic estimation strategies when the specification has CSC conflicts must be sought. On the other hand, simple but accurate methods for performance estimation should be devised to increase the degree of automation and provide a wider exploration of the solution space.
