We offer a technique to partition a centralized control-flow graph to obtain distributed control in the context of asynchronous highlevel synthesis. The technique targets Huffman-style asynchronous controllers that are customized to the problem. It solves the key problem of handling signals that are shared between the partitionsproblem due to the incompletely specified nature of asynchronous controllers. We report encouraging experimental iesults on realistic examples.
IBM T.J. Watson Research Center
Yorktown Heights
,4bstract
We offer a technique to partition a centralized control-flow graph to obtain distributed control in the context of asynchronous highlevel synthesis. The technique targets Huffman-style asynchronous controllers that are customized to the problem. It solves the key problem of handling signals that are shared between the partitionsproblem due to the incompletely specified nature of asynchronous controllers. We report encouraging experimental iesults on realistic examples.
Ihtroduction
Asynchronous circuits are receiving considerable attention of late due to their promise in many areas including performance and energy consumption. A central problem in asynchronous highlevel synthesis is that of partitioning a centralized control-flow graph to obtain distributed controllers. A centralized controller can often be more complex (in terms of logic) than a collection of distributed controllers, and can have slower signal paths through it. They can also result in increased wire lengths, involve timing assumptions of a global nature. In this paper, we present the automated control partitioning algorithm incorporated in our highlevel synthesis tool for asynchronous circuits called ACK which accepts a subset of high-level Petri-nets as input and generates partitioned two-phase controllers and the associated data-path as output. A Verilog front-end is also available for ACK.
One way to obtain distributed control circuit realizations is by employing macromodules [ 151. However, most macromodule libraries contain only a limited number of macromodule types, 2nd hence distributed control realizations based on macromodules 2 re often inefficient [6] . A class of controllers called burst-mode c ontrollers that are potentially more efficient than macromodules, B nd can be customized 14, 1 I, 171 have been proposed and widely used in a number of non-trivial designs. However, burst-mode synthesis procedures cannot handle designs beyond a certain input/output (YO) size, due to the complexity of many of the global optimizations used. Hence, in previous designs where this I/O size was exceeded, burst-mode controllers were manually partitioned hrgely depending on the designer's intuitions. This procedure is \ ery tedious and results in burst-mode controller descriptions that are incomprehensibleand hard to verify. Moreover, even when the t'urst-mode synthesis of centralized controllers with large VO sets *This research was done when the first author was a graduate student at the University of Utah and was supportedin partby University ofUtah Research Fellowship. *Supportedin partby NSF Award MIP 9215878 33rd Design Automation Conference@ Pernission to make digitalhard copy of all or part of this work for personal or classrooin use is granted without fee provided that copies are not made or distributed for its date appear, and notice is given that copying is by permission of ACM, Ync. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prica specific permission and/or a fee. 
Department of Computer Science
University of Utah is possible, the resulting controllers can be inefficient, as pointed out above. Therefore, in an automated high-level synthesis environment such as ACK, where large control graphs can be generated from users's high-level HDL description of non-trivial designs, an automated controller partitioning method is essential. This is the problem addressed in this paper. A key problem in partitioning stems from the fact that asynchronous controllers are, in general, incompletely specified. More specifically, the steps of critical race free state assigment and hazard-free logic minimization in burst mode synthesis rely on the fact that the environment of the controller does not present any of the unspecified behaviors. Under this assumption, the sharing of signals between the partitions is a non-trivial problem. For example, suppose an input signal is shared between a collection of partitions. When the environment generates a change on this signal, to which ofthese partitions must the change be sent to? We provide a method to address this issue.
Related Work
In [ 101, a technique called process decomposition is proposed.
Process decomposition does not involve signal-sharing between incompletely specified machines. Signal-sharing is addressed in macromodule based design systems [I, 21 by using additional macromodules such as Toggles [ 151 and Decision-waits [5] 
Overview of ACK
A design entered in ACK is a Petri-net description organized as a collection of sequential processes communicating through CSPstyle channels. Each transition of the Petri-net (except fork and join) is annotated with an action, which can be a two-phase [ 151 signal transition on an input or output wire (input transition names are underlined), an assignment statement, a Boolean expression (used for choices), or a CSP-style communication primitive. Forkjoin concurrency is allowed within sequential threads. The fork and join transitions are labeled by an "E" action denoting a no-op. Synthesis in ACK proceeds by first allocating requisite data path resources, which include 1ibraryNiewlogic-synthesized operators for computation actions, C-elements for channel communication actions, select elements for data dependent choices, and library registers for storage. The underlying control-graph is then obtained by refining each high-level Petri-net action into twophase handshake actions, using standard approaches [ I , 161. The end result is one centralized control graph per sequential process.
The partitioned synthesis problem addressed in this paper is: given a centralizedcontrol graph anda set of partitions on it chosen The identification of the partitions is not addressed here, though a few automatable heuristics, such as keeping logically unrelated iterative loops that share signals in separate partitions, usually yield good results in terms of increased performance and reduced logic complexity.
Note that some places in the Petri-net in Figure 1 have been omitted due to paucity of space.
Partitioning Algorithm
Centralized control graphs which form the input to the partitioning phase of ACK are "state machines [ 131 with forWjoins" (SFJ), that is, a single threaded graph with forWjoin concurrency. SFJ graphs are triples G = (P, T, F) where P = {PI, p 2 . . . . p , } is a finite set of places, T = { t l , t z .... tm} is a finite set of transitions, and F (P x T) U (T x P) is a flow relation. T consists of fork transitions (Tf ), join transitions (Tj), and sequential transitions (Ts) which have an in-degree and out-degree of one. For each t E Tf, there is exactly one t' E T j (and vice versa) such that the out-degree oft, N , is the same as the in-degreeoft'. The ith output place of fork transition t and the ith input place of corresponding join transition t are, respectively, the (only) input place and (only) output place of a single threadedsubgraph (STS) that models "the i-th thread" of the forkljoin. An STS, Cst = ( PSt, Tst , Fat) is a subgraph of C where Pst P, T,, C T,, and FSt C F is the flow-relation restricted to Psi and Tst. An STS must be disjoint from all other STSs (not share any place or transition). T,, should not contain a fork or a join (but may contain choices). The unique entry-place and exit-place of an STS are called its inputplace and output place, respectively.
Each transition in T, is annotated by a non-empty burst of two-phase input-or output-(but not both) signal transitions. Transitions in Tf and T j are labeled by E . To simplify things, we assume that no two bursts labeling transitions contained in two different STS graphs of the same forWjoin involve the same wire name. Also, in order to generate legal burst-mode machines [ 1 11 from the partitions through burst-mode reduction [ 6 ] , the original SFJ graphs must obey the following restrictions, in that they are (1) initially quiescent, and attain quiescence infinitely often (a quiescent state is one where no output must be produced be-, fore consuming at least one input); ( 2 ) deterministic, (3) delay insensitive, and (4) obey the subset property [l 11.
At the end of partitioning, the goal is to generate sequential machines where eachtransition is annotated with input and output bursts. It has been shown [6] that suchapartitioned machine P can be converted into a burst-mode machine C that has, as its interface traces, the set of traces generated by P when operated in the fundamental mode ( P is allowed to attain quiescence after each set of inputs to it). Thus, the real proof obligation of our partitioning procedure (to be described) is to ensure that the interface traces for a collection of burst-mode controllers implementing the partitions of a well-formed SFJ graph are the same as that for a centralized burst-mode controller implementing the same SFJ graph.
Apartition of an SFJ graph is either any of the STS subgraphs of a forWjoin (these are called requiredpartitions) or one of the STSs obtained by the following procedure applied to each of the forwjoin pairs, (t, t'): (1) remove the forWjoin pair and all the STSs subtended by them; (2) assign the input place o f t as the output place of the STS preceding t; (3) assign the output place oft' as the input place of the STS following t'. Any of the STSs obtained above can be further partitioned into its constituent STSs. The set of input-and output-places of the partitions of the given SFJ are calledpartztzoningplaces (PP). In Figure 1 , Ik and o k (for k E 1 . . . 3 ) are the input-and output places of the three partitions, and form the partitioning places. Additional partitioning places may be chosen from within the partition P3 (though we don't do so in our example). We do not consider STSs with an empty set of transitions as partitions.
Partitioned Controller Synthesis
Each partition is supported by its own controller initialized to its own initial state. The partition controllers also incorporate provisions to hand-over control to other partition controllers. We simplify our initial exposition by: (1) considering all PPs to be either the input(s) or output(s) of forks and joins; (2) assuming that signals are not shared between partitions. These assumptions will be relaxed momentariiy. Consider the controller Ci supporting the ith required partition cif a forkljoin. Proper control hand-over between C; and the controller for the partition preceding the fork, C-, is arranged by making the very first transition processedby C; from its start state t D be {@}, where done is the last signal transition generated by C-before it goes back to its start state where it in turn waits for a {done} signal from another partition, telling it to resume execution. This ensures that whenever controller C-finishes its execution, the Cis are all started. The remaining actions of C i are the same as the actions present in the STS graph of the ith required partition. Proper control hand-over between Ci and the controller for the partition following the join, Ct , is arranged by making the a'ery last transition processed by C i before it goes back to its start state to be {done;}. The very first transition processed by Ct out clf its start state, then, is { donei}, i E 1.. . K , where K is the arity of the forWjoin. This ensures that whenever all controllers (7; finish, Ct takes over. Note that C-and Cs may be the same Flartition. Now, relax assumption ( I ) above and consider PPs that are not associated with forkijoins. Every such PP demarcates two partitions with their own supporting controllers starting in their own initial states. The PP serves as a "merge" place for the threads preceding it and as a "choice" place for the threads following it. Observe that these threads are sequential with respect to eachother. Consider the modifications that must be made to the controller that supports the partition preceding PP. For this partition we add the output burst {done} to be the last action of this controller before it goes back to its start state. Similarly, consider the controller supporting the partition following PP; we add the input burst {done} as the very first action of this controller from its start state. This ensures that whichever way the merge-place is entered, the choice-place is enabled. The controller then proceeds t83 carry out the remaining actions of the partition following the I'P. Figure 2 illustrates these ideas.
Such a distributed control realization of an SFJ graph (as described above) manifests exactly the same interface traces as a centralized controller realization of the same SFJ graph when the distributed control realization is operated in the fundamental rnode. This is because whenever control is handed over through the "done" signals, the done signal generated by preceding partition(s) are (all) absorbed by the following partition(s) before the environment is allowed to send any new inputs to the following partition. Thus, as far as the environment is concerned, the right set of partitions become active at the right times. The rest of this paper concerns itself with relaxing assumption (2) above.
Synthesize Sharing Arrangements
Define the input set I n p ( K ) C W of a partition K to be a set of input wires which K is sensitive to, meaning these inputs make a transition somewhere within partition K . Define Out( K ) similarly. If a partition is sensitive to a set of input and output wires disjoint from that of all other partitions, it can be directly implemented. For output wire o that is shared between partitions K1 and K2, we rename o to 01 in K1 and to 0 2 in K2, and synthesize the resulting controllers using our method. The outputs cI1 and o2 are then merged using an XOR-gate to produce the output signal o. In Figure 2 , output d is generated in this manner. This method works because of the two-phase nature of the control signals, and because any two occurrences of an output signal (the two input signals of the XOR) transition are guaranteed to be sequentially ordered.
The difficulty in handling input sharing is that we must ensure that only the "right partition" must see the input transition in each state. In Figure 2 , the first and the second occurrences of input a are seen by partition PI while the third occurrence must be seen by partition P3 fthe choice is resolved through e, and by partition P1 if the choice is resolved through f . As with shared outputs, the first step is lo rename the shared inputs within each partition. An input-translator state machine is then derived that translates the input signal from the environment into these renamed signals, local to the componentmachines, at the right times. The controller Glue1 in Figure 2 achieves this for signal a, in our example.
For the algorithm to generate input translators, we assume that any input signal that resolves a choice (appears in the burst labeling the transition immediately following a choice place) occurs in no other partition than the partition that contains the choice. A solution that relaxes this assumption exist [8] and proof is in progress.
The steps in obtaining input translators are as follows: (1 In our case, we retain partition PI. (3) In the resulting graph, following each occurrence of the transition of the shared signal i, introduce a corresponding output ok. In our example, following the two occurrences of inputs g falling in partition PI, we generate output a l whereas following the occurrence of input g in P3, we generate output a3. These, then, are internal signal transitions that get sent to the night partition at the right time.
Synthesize Final Circuits
Each of the controller descriptions can now be synthesized into asynchronous burst-mode circuits following the procedure described in [6] . In order to obtain a burst mode controller specification from a two-phase controller specification, we need to know the initial input signal values for each of the controllers.
The initial values of all external input signals are specified by the user. All the internal signals that are introduced during partitioning can be initialized to any value due to use oftwo phase protocol (we initialize them to 0). The resulting description can be synthesized into a burst-mode machine which can then be synthesized using 
Results and Conclusions
We have conducted comparisons between centralized and partitioned controllers on a large number of examples, some of which are shown in Table 1 . Apart from making it possible to synthesize larger designs, partitioning can also decrease synthesis time by several orders of magnitude. Partitioning also often significantly decreases the number of literals in the synthesized design and increases the overall controller performance compared to that of a centralized implementation.
In Table 1 we show the partitioning results for a CD Player Error Detector from [7] , a barcode reader from the High Level Synthesis Design benchmarks [ 121 adapted to asynchronous operation, a loop example, a factorial computation unit, and an iterative implementation of the GCD algorithm. For the CD Player Error Detector and the Barcode Reader the synthesis of the centralized controllers did not complete due to the complexity of the synthesis task. The results for these are marked with n.a. In the table the # B M t r a n s column is a measure of controller complexity and shows the number of burst mode transitions in the specification of the controlleir. For the examples where the centralized controllers finished synthesis, a layout was generated from a two level standard gate implementation and the performance between the centralized and partitioned controllers was measured. The comparison showed an average performance increase for the partitioned controllers of between I O to 30%. Note that this comparison only exploited performance advantages due to temporal locality. Partitioning also gives us the possibility to take advantage of spatial locality, which as feature sizes gets smaller and wire delays become significant, is an important factor for high performance designs.
In this paper, we have presented a method to deal with the partitioning of asynchronous controllers. This work specifically provides a partitioning method in the context of asynchronous high level synthesis methods that target state machine controllers, although the basic ideas can be extended to other asynchronous partitioning problems. 
