Abstract-This paper presents a technology mapper that optimizes the average performance of asynchronous burst-mode control circuits. More specifically, the mapper can be directed to minimize either the average latency or the average cycle time of the circuit. The input to the mapper is a burst-mode specification and its NAND-decomposed unmapped network. The mapper pre-processes the circuit's specification using stochastic techniques to determine the relative frequency of occurrence of each state transition. Then, it maps the NAND-decomposed network using a given library of gates. Of many possible mappings, the mapper selects a solution that minimizes the sum of the delays (latency or cycle time) of all state transitions, weighted by their relative frequencies, thereby optimizing for average performance. We present experimental results on a large set of benchmark circuits, which demonstrate that our mapped circuits have significantly lower average latency and cycle time than comparable circuits mapped with a leading conventional mapping technique which minimizes the worst-case delay. Moreover, these performance improvements can be achieved with manageable run-times and significantly smaller area.
I. INTRODUCTION
Asynchronous circuits are event-driven. That is, they can respond immediately to the completion of a computation without waiting for a clock edge. Since the computation delays are, in general, data-dependent, the performance of asynchronous circuits should be characterized by their average-case behavior, not their worst-case behavior, which is the gage used for synchronous circuits. Therefore, the performance goal of technology mapping techniques for asynchronous circuits should be to maximize the average-case speed, unlike synchronous circuits which are mapped to minimize the worst-case delays [7] , [13] , [18] , [28] , [33] , [34] .
Furthermore, most existing technology mappers for asynchronous circuits focus on ensuring hazard-freedom [17] , [21] , [31] , [30] without regards to performance. One reason for this apparent lack of research is that an explicit technology mapping step is not needed in all asynchronous circuit design methodologies. Specifically, many asynchronous design methodologies rely on the existence of specialized cell libraries which make the technology mapping step trivial [10] , [19] , [32] , [4] . For controloriented synthesis methodologies, such as burst-mode designs, however, the technology mapping using existing standard-cell libraries is often an important part of the design process. In fact, the motivation for this work arises from the Asynchronous Instruction Length Decoder Project at Intel Corporation [26] . The lack of tools resulted in manual mapping of control circuits using back-of-the-envelope techniques, which was both laborintensive and sub-optimal. This paper proposes the first known technique to optimize the average-case performance of a popular form of asynchronous controllers called burst-mode control circuits [22] . The control circuits are specified in extended burst-mode (XBM) [39] and implemented using a modified Huffman architecture [22] , [36] , [39] . Each circuit consists of a combinational logic block with some outputs fed-back to the inputs through delay elements. The XBM specification and an implementation of SCSI-INIT-SEND controller are shown in Fig. 1 as an example.
In each state, the circuit waits for a set of specified input transitions, referred to as an input burst; upon detecting specified input transitions, it toggles a set of output signals, referred to as an output burst. In addition, depending on the implementation style, the circuit may internally toggle a number of state signals either concurrently with output signals or in a separate state burst before or after the output burst [36] . For the machine to operate properly, the environment must obey the generalized fundamental-mode constraint which essentially states that the next input burst must arrive after the fed-back signals have propagated deeply enough into the combinational logic block, in order to ensure that there are no unexpected hazards [6] , [36] .
There are two parameters often used to characterize the performance of burst-mode circuits: latency and cycle time. The latency of a state transition is the maximum delay from the transitions on primary inputs to the transitions on primary outputs. If the state signals toggle concurrently with outputs or after outputs, the latency is determined by the delay of the combinational logic. If the state signals toggle before the outputs, however, the latency includes the delay through the feedback delay elements. The cycle time of a state transition is the minimum delay necessary between the end of the input burst and the beginning of the input burst of the next state transition. Note that the cycle time of a state transition is the sum of its latency and the associated fundamental-mode constraint.
For many applications, the optimization of latency is more important than that of cycle-time. Consider a situation in which controllers are responsible for controlling the sequence of various datapath operations [38] . In these applications, the environmental response time corresponds to the delay of the datapath units and thus is relatively long. Consequently, the fundamentalmode constraint is typically easily met and the system performance is determined by the sum of the datapath and controller latencies. Thus, in these applications, the burst-mode cycle time does not adversely affect system performance.
In control-dominated applications, on the other hand, the environmental response time of a controller may just be the latency through another controller and thus can be small. In those cases, the fundamental-mode constraints are harder to meet, and, in some cases, delays must be added to increase the environmental response time. Ideally, these delays should be placed inside the communicating controller such that they delay only the response time of those state transitions that violate their fundamentalmode constraints. Otherwise, it may be necessary to add the delays to the output signals to meet the worst-case fundamentalmode constraint, which unfortunately slows down the environment's response time uniformly for every state transition. In this paper, we assume that the environment is ideal and thus our goal is to minimize the controller's average cycle time. Extensions to minimize the worst-case fundamental-mode constraint are straightforward but not discussed here.
Because an output signal is often specified to change in multiple state transitions, its logic may contain different paths that are excited in different state transitions. To optimize for average performance, the delay along these paths should be prioritized according to the relative frequencies of the associated state transitions. In order to achieve this, we assume that conditional probabilities of all state transitions are either provided by the user or estimated through a behavioral simulation of the circuit in its environment. In the SCSI-INIT-SEND circuit, the conditional probabilities depend on the packet size of SCSI transfers. For example, using a reasonable packet size of 1K bytes, the conditional probabilities from state 3 to state 4 and to state 6 would be 0.999 and 0.001 respectively. Our mapper pre-processes these conditional probabilities, using a Markov chain analysis, to obtain the relative priority of each state transition.
Our technology mapper maps only the combinational logic portion of the controller. However, the mapping choices, and thus the structure of the combinational logic, affect the amount of feedback delays required and the fundamental-mode constraint. To account for this, we use a combination of techniques used in [6] and event-driven simulation to estimate the required feedback delay and fundamental-mode constraint during the mapping procedure. Consequently, our mapper can optimize both the latency and the cycle time (the choice is user-specified).
Our mapper performs two steps: decomposition and covering. In the decomposition step, optimized circuit equations are decomposed into a set of available base functions, such as 2-input NAND's and inverters. This decomposition is necessary to ensure that every node of the network can be implemented by at least one library gate. It also increases the granularity of the network, which provides more mapping options, yielding better final circuits [28] . A unique feature of our decomposition algorithm is that it incorporates an efficient heuristic that optimizes the decomposition for average performance. In the covering step, the optimized decomposed network is covered with available library gates with the objective of optimizing average performance under a given area constraint.
Our covering algorithm is motivated by Chaudhary and Pedram's covering algorithm [7] which evaluates area-delay tradeoffs based on static (pessimistic) timing analysis. Novel aspects of our algorithm are as follows. Generation of a multi-dimensional area-performance tradeoff surface that extends the notion of a two-dimensional area-delay tradeoff curve proposed in [7] .
Adaptation of an input-pattern-dependent method to consider delays of only true critical paths associated with each state transition [15] , thereby avoiding the false path problem.
Inclusion of a new heuristic to remove likely-non-optimal mappings from consideration to reduce both CPU run-times and memory usage.
We tested our algorithm on a large set of benchmark circuits. To evaluate the benefit of our algorithm, we compared the average performance of our mapped circuits to the average performance of comparable circuits mapped by an algorithm that minimizes the worst-case delay. The results indicate that our circuits have significantly better average performance and are, on average, significantly smaller, with comparable CPU run-times.
The remainder of the paper is organized as follows. Section II provides a necessary background on technology mapping, burstmode circuits, and timing analysis of synchronous circuits. Section III formalizes our notions of average latency and cycle time for burst-mode circuits. Sections IV and V describe our average-case decomposition step and covering step, respectively. Section VI reports our experimental results and Section VII presents our conclusions.
II. BACKGROUND
In this section, we describe previous work on technology mapping for both synchronous and asynchronous circuits, review burst-mode circuits, and review worst-case timing analysis of synchronous circuits.
A. Previous work on technology mapping

A.1 Technology mapping of synchronous circuits
For synchronous circuits, technology mapping is often reduced to a problem of covering a directed acyclic graph (DAG) which is then approximated by a sequence of optimal tree coverings [13] , [28] . First, a set of base functions is chosen, such as a 2-input NAND gate and an INVERTER. Second, the set of equations for the target circuit (typically obtained from a technology independent optimization) are decomposed into a graph, in which each node is a base function. Last, the decomposed graph is covered using available library gates so that a selected cost function is minimized. Initially, algorithms were developed to minimize the area [13] and the worst-case delay [28] of the mapped circuit. Later, both Chaudhary and Pedram [7] and Touati et al. [33] proposed more sophisticated algorithms to obtain the covering that yields the minimum area under given delay constraints.
A.2 Technology mapping of asynchronous circuits
Different asynchronous design styles make different assumptions about the delays within the circuit and how the circuit interacts with the environment. These characteristics have a dramatic impact on their technology mapping. This section briefly characterizes these differences. This paper focuses on burst-mode circuits, a class of asynchronous circuits that rely on the fundamental-mode assumption [35] , which states that the circuit is allowed to settle between input transitions that arrive at the circuit in bursts. Other asynchronous design styles, such as speed-independent circuits and timed circuits, however, operate in the input/output mode. That is, the environment may introduce new input changes concurrently with the circuit generating new outputs as long as the input behavior satisfies a given specification.
Siegel and De Micheli [30] studied the technology mapping problem for burst-mode circuits. They found that, with just small modifications, synchronous technology mapping solutions can be safely used. More specifically, Siegel and De Micheli used results from Unger [35] to demonstrate that the standard tree decomposition of the equations into base functions does not introduce hazards. This result was formalized by Kung who proved that the tree decomposition is a hazard-freedom-preserving logic transformation for fundamental-mode circuits [16] . Moreover, Siegel and De Micheli presented an algorithm that analyzes implementations of library gates to determine if they might in some cases cause a hazard when used in the covering phase. Their results demonstrated that most library gates can be used safely, but some MUXes and AOIs, for example, might cause hazards depending on how they are used. When such gates match to a particular node in the graph, further analysis is required to determine if the match causes a hazard in that instance [30] . The key shortcoming of their work is that the underlying synchronous technology mapper they used is limited to optimizing the worst-case performance, not the average-case performance. It is this limitation that our approach addresses.
We must emphasize that the hazard-freedom preserving feature of decomposition in burst-mode circuits does not hold for input/output-mode circuits. Intuitively, this is because, in input/output-mode circuits, extra transitions on internal signals newly generated by the decomposition are not guaranteed to settle before the circuit inputs change and consequently may unexpectedly interact with changing inputs, thereby causing hazards. Consequently, technology mapping techniques for speedindependent and timed circuits have to pay extra attention to the hazard-preserving nature of their mapping techniques [1] , [5] , [14] , [9] , [21] , [31] .
We also note that the additional challenges associated with the decomposition of input/output-mode circuits may partially explain why these efforts have not yet addressed optimizing for average-case performance. Moreover, we believe that the ease of decomposition for fundamental-mode circuits makes burstmode circuits a good initial target for average-case optimization research.
B. Burst-mode control circuits
This section reviews the specification, implementation, and operation of burst-mode circuits.
B.1 Extended burst-mode (XBM) specifications
An extended burst-mode (XBM) specification [36] , [39] consists of a finite number of states, a set of labeled state transitions connecting pairs of states, and a start state. Fig. 1a shows the XBM specification of the SCSI-INIT-SEND circuit, which has a conditional input cntgt1, 3 edge inputs (ok, rin, fain), and 2 outputs (aout, frout [36] , [39] . If a state transition is labeled with a , the following state transitions in the specification must be labeled with a or with a + or a ? (a + and a ? terminate the sequence of don't cares, which is the reason these are called terminating edge signals). In addition, a directed don't care may change at most once during a sequence of state transitions it labels, i.e., changes monotonically, and, if it does not change during this sequence, it must change in the state transition its terminating edge labels. A terminating edge which is not immediately preceded by a directed don't care is called compulsory, since it must appear in the state transition it labels [36] , [39] . In Fig. 1a , ok + is a compulsory edge because it must appear in the transition from 0 to 1. rin + in the transition from state 5 to 3, on the other hand, is a terminating edge, but not a compulsory edge, because rin can rise at any point as the circuit transitions from state 4 through state 5 but it must have risen by the time the circuit enters state 3.
B.2 3D implementation style
In this paper, we assume that XBM specifications are implemented as 3D machines [39] . A 3D machine is formally repre- The hardware implementation of a 3D machine is a combinational network in which all state variables and some primary outputs are fed back as inputs to the network through delay elements. For example, the 3D implementation of the SCSI-INIT-SEND controller is shown in Fig. 1b . Note that the state variables, and G is a set of gates that label nodes. Edge (m; n) 2 E connects node m to node n. m is called a fanin of n and the set of fanins of n is denoted FI(n). Gate g associated with node n represents the logic function of that node and is referred to as g(n).
If m 2 FI(n), we say that m is a fanin of gate g(n).
For convenience, the set of nodes in X Y 0 Z 0 is referred to as DAG inputs (DI's). Similarly, the set of nodes in Y Z is referred to as DAG outputs (DO's).
B.3 State transitions in 3D machines
The state transitions in 3D machines can be designed in two ways: two-phase or three-phase. In a two-phase state transition, the machine waits for an input burst and then generates a concurrent output and (possibly empty) state burst. In a three-phase state transition, the machine waits for an input burst and then generates a state burst followed by an output burst [36] . Alternatively, three-phase state transitions can generate the output burst before the state burst. To simplify the notation, we focus on the former three-phase design; extensions to handle the latter threephase design are straightforward.
1 3D-gc machines which are implemented using generalized C-elements [40] are not considered in this paper and the technology mapping of these circuits is left as future work.
C. Worst-case timing analysis for synchronous circuits
To put our asynchronous timing analysis techniques into perspective, we first briefly discuss several worst-case delay analysis algorithms, including static timing analysis, static sensitization analysis, and dynamic sensitization analysis. To simplify the exposition, we restrict ourselves to circuits composed of simple gates, such as NAND, NOR, AND, and OR, etc.
For a synchronous circuit, the worst-case delay of a combinational block sets a lower bound on the clock cycle time of the circuit. Traditional static timing analysis estimates the worstcase delay by computing the topological delay of the longest path [11] . This estimation is pessimistic since the longest path may not be exercised by any input pattern, i.e., no event can propagate along the path. Such a path is said to be non-sensitizable and is often called a false path [8] , [3] , [20] , [24] .
Many researchers have worked on the path sensitization problem to eliminate false paths. Static sensitization analysis, which is based on the well-known D-algorithm [27] , ignores the timing of each input signal and consequently may underestimate the delay of the circuit [3] . Dynamic sensitization analysis, on the other hand, considers the timing of input signals and thus finds the true critical path but requires a computationally intensive task of analyzing all possible input patterns [8] . More efficient input-pattern-independent methods have been proposed to approximate dynamic sensitization analysis, providing an upper bound on the worst-case delay [20] , [24] .
To the best of our knowledge, these advanced techniques have not been incorporated into a technology mapping procedure. Instead, most technology mappers use more pessimistic, static timing analysis techniques. Consequently these technology mappers can result in non-optimal circuits because of the false path problem.
It is also important to note that all the above techniques adopt the floating-mode assumption; i.e., the initial values on circuit nodes are unknown at the time the input pattern is applied. This is because these techniques target combinational circuits in which the sequence of input patterns applied is assumed to be arbitrary.
III. AVERAGE PERFORMANCE OF BURST-MODE CIRCUITS
This section formalizes our technology mapping objective functions: average latency and cycle time. It first defines the delay through the combinational blocks and the required feedback delays and fundamental-mode constraints. It then formalizes our definitions for average latency and cycle time using Markovian analysis.
A. Determining the delay through a combinational logic block
This section describes and extends Kung's technique [15] to compute the delay through the combinational logic blocks of burst-mode circuits. Kung observed that in such circuits the sequence of possible input patterns is well-defined and the internal signals are guaranteed to stabilize to known values before the application of each new input pattern. Consequently, he adopted the single step transition mode model rather than of the less accurate floating-mode model mentioned above [15] .
For our application, we associate one input pattern for each burst that generates transitions at the DO's. Thus, for both two-phase and three-phase state transition, we create a cldpattern modeling the combinational logic delay of the input burst. Three-phase state transitions, however, have an additional cld-pattern modeling the combinational logic delay of the state variable burst. The values in the cld-pattern correspond to the values of all DI's immediately after the last terminating transition of the burst arrives at the circuit. For example, in Fig. 1a , the input pattern for the input burst hcntgt; rin; faini of state transition 3 ! 4 is h 1, F, 0 i. fain is 0 because it falls during state transition 2 ! 3.
cntgt is 1 because we assume that all conditional signals are stable at their sampled value before the compulsory transitions arrive. Note that all state variables introduced by the synthesis process should also be included in the pattern. Moreover, we assume that the user indicates when directed don't care signals change. This latter assumption can be easily generalized, assuming that the user specifies the probability of the directed don't care firing.
The first step of Kung's delay analysis is to perform an extended 8-valued logic simulation to obtain the logic value at the output of each circuit node for each input pattern [15] . In this logic, 1 and 0 denote stationary values, R and F denote transitional rising and falling values, S0 and S1 denote static 0 and static 1 hazardous values, and DR and DF denote dynamic rising and falling hazardous values.
To perform the 8-valued simulation, the DI's associated with an input pattern are first set to either a stationary or transitional value depending on whether the input pattern indicates that they have changed compared to their known initial value. These values are then propagated through the circuit using 8-valued truth tables of all gates until the DO's are reached. For example, Table I shows the 8-valued truth table of a node associated with a 2-input NAND gate.
R DR S1 S1 DR S1 S0 1 DF DR S0 S1 DF DR S0 S1 1 S1 S1 S1 S1 S1 S1 DR DF 1 DF S1 DF S1 DF S1 DF DR 1 S1 DR DR S1 S1 DR Since the circuit is hazard-free, the DO's must have either stationary or transitional values. Internal signals, however, may have hazardous values. Kung observed that because the DO's are hazard-free under the unbounded gate and wire delay model, the delay at any DO cannot be influenced by the delay of internal nodes with hazards. In addition, he realized that any node with a stationary value cannot affect the delay of any DO. Kung thus argued that the delay analysis for a given pattern i can be performed on a reduced circuit in which all nodes with hazardous values, stationary values, and any value which would be eventually suppressed by a stationary 0 value are removed. In practice, this is implemented by associating each input pattern i with the set of nodes N(i), referred to as sensitized nodes for pattern i, which have only transitional values that affect DO's. We thus compute the delay of a cld-pattern for only sensitized nodes, thereby savings both run-time and memory. Because we adopt a fixed-delay model, the delay of a node having a transitional value for a given pattern is fixed and it is called pattern arrival time of the node.
To define the pattern arrival time, we first review the notions of controlling and dominance [15] . A sensitized fanin of a simple gate (whose output is also sensitized) is controlling if it has a logic value that can independently determine the timing of the output transitional value. For example, a sensitized fanin of a NAND (NOR) gate with a value F (R) is controlling. A sensitized fanin is non-controlling otherwise. Node m 2 FI(n) dominates n for pattern i if m's pattern arrival time plus the pin-topin delay of g(n) from m to n is minimum (maximum) among all controlling (non-controlling) sensitized fanins. 2 .
Using these definitions, we define the i th pattern arrival time of n, denoted pat(n; g(n); i), recursively on the structure of the circuit as follows:
pat(n; g(n); i) = pat(m; g(m); i) + p2p-delay(n; m; g(n); v(i)); (1) where m dominates n for cld-pattern i and p2p-delay(n; m; g(n); v(i)) is the pin-to-pin delay of gate g(n) from m to n for v(i), the transitional value at n for i. For example, consider a 2-input NAND node n, with sensitized fanins a and b. Assume that the pin-topin delays from a to n and from b to n are 1 and 2 respectively, and the pattern arrival times at a and b for pattern i are 3 and 4 respectively. If the logic values at a and b are F (transitional falling) for this pattern, then a is the dominant fanin of n because both fanins are non-controlling and min(1 + 3; 2 + 4) is 4.
Note that the pattern arrival time for DI's can be set by the user. When processing cld-patterns, however, their default value is 0. Once set, repeated application of the above definition yields the pattern arrival times at the DO's, i.e., the delay through the combinational logic.
B. Determining the minimum required feedback delay and settling times
In addition to the delay through the combinational logic, the performance of a burst-mode circuit is determined by the minimum required feedback delay and the settling times. This subsection describes how these values are determined.
Consider a circuit in which the combinational logic has been mapped but the feedback delay has not been added. To compute the minimum feedback delay to add, our algorithm first determines which gates may exhibit an essential hazard, referred to as essential-hazard problem gates [6] . For each essential-hazard problem gate, our algorithm determines the minimum amount of feedback delay that guarantees that the essential hazard does not occur. It then conservatively take the maximum of these delays as the feedback delay. Once the feedback delay is determined, it can compute the settling time required for a given pair of state transitions. It does this by first determining which gates may exhibit hazards due to a violation of the fundamental-mode assumption, referred to as a fundamental-mode problem gate. For each gate, it determines the minimum amount of delay the environment must wait before injecting the new input burst. Our algorithm then conservatively take the maximum of these delays as the required fundamental-mode constraint for this state transition.
To identify the essential-hazard and fundamental-mode problem gates our algorithm uses the techniques described in [6] . The essential-hazard problem gates are found by performing an extended-logic simulation of the circuit with a new pattern, called a fbd-pattern, in which the feedback state variables change simultaneously with the input burst. Gates which have hazards for this pattern but do not exhibit hazards during the input and state patterns are the essential-hazard problem gates [6] . Similarly, the fundamental-mode problem gates are found by performing a extended-logic simulation with a new fmc-pattern, in which the last burst of the current state transition change simultaneously with the input burst of subsequent state transition. Gates which have hazards for this pattern but do not exhibit hazards during the last burst and next input burst separately are the fundamental-mode problem gates [6] .
Our approaches to compute the minimum feedback delay and settling time for a given problem gate are also similar. Consider computing the minimum feedback delay required for a given problem gate g. Our algorithm first derives the propagation delays to the fanins of gate g excluding the feedback delays (which at this stage of the design process are unknown). Consider the case where the essential hazard at g is caused by a fanin f 1 changing in response to the changing feedback state variable occurs sooner than a second fanin f 2 changing directly in response to the input burst. The minimum feedback delay necessary to avoid an essential hazard at this gate is obtained by subtracting the propagation delay to f 2 plus the corresponding pin-to-pin delay with the propagation delay to f 1 plus the corresponding pin-to-pin delay.
When calculating the propagation delay to the fanin of problem gates (such as f 1 and f 2 described above), we need to consider the delay of nodes that exhibit hazards. Consequently, the delay calculation used for cld-patterns, which ignores hazards, is not applicable. The basic issue to be resolved is that hazardous nodes may transition multiple times in response to an fbd-or fmc-pattern and the delays of these transitions must be considered. To do this, our algorithm uses two different approaches.
The first approach is to use standard event-driven simulation in which we associate an arrival time with each transition edge of a hazard caused by a fbd-or fmc-pattern. At fanin of a gate, each transition edge representing an input event may cause an output event whose associated arrival time is computed as the arrival time associated with the input event plus the corresponding pin-to-pin delay of the associated gate. For the feedback delay computation described above, the propagation delay of f 2 (f 1 ) is set to be the arrival time associated with the latest (earliest) event of f 2 (f 1 ). This ensures that the minimum feedback delay is not underestimated.
One disadvantage of this approach is that an exponential number of arrival times is possible for a single node. Consequently, although computationally feasible in our decomposition routine, as is made more clear in Section V, it becomes too costly to use during our covering routine. Thus, during covering, for each node, our algorithm computes a single arrival time that is an upper bound of the time needed for the node to settle. Specifically, our algorithm computes the latest possible arrival time when the last event (edge) of a hazard can occur. For example, consider a 2-input AND gate driving node n, with fanins a and b that exhibits a static 0-hazard due to a rising and b falling. Because the last edge of the static 0-hazard can occur only after b falls, the latest arrival time of node n is the arrival time of b falling plus the corresponding pin-to-pin delay of the gate. To simplify notation, we refer to this computed arrival time as the pattern arrival time for fbd-and fmc-patterns. Note that more advanced techniques in which the delay of the hazard is captured using both an upper and lower bound is also possible (e.g., see [6] ). However, we have found that using a single upper bound is adequate and has the benefit of reducing time and space requirements significantly.
C. Formalizing our average performance objective functions
This subsection describes how to obtain the probability for each state transition. Using these probabilities, it then formalizes the two objective functions we optimize for: average cycle time and average latency.
C.1 The probabilities of state transitions
First, we model the execution of the circuit with a stochastic process fB n ; n = 0; 1; 2; : : :g. B n (the n th state of the process) equals state transition t if the n th state transition executed by the process is state transition t. Consider the circuit depicted in Fig. 1 . If from the initial state (state 0) of the circuit, the machine executes state transitions 0 ! 1 followed by state transitions 1 ! 2 and 2 ! 3, then B 1 = 0 ! 1, B 2 = 1 ! 2 and B 3 = 2 ! 3.
Second, we let a trace T of the circuit be a sequence of specified state transitions, starting with the initial state. We denote by T k all possible traces of length k. For each such trace T 2 T k we associate a probability which reflects how common this trace is with respect to other traces of the same length. These probabilities are subject to the constraint that for a given length k the probabilities of the traces of length k add up to 1, i.e., P T2Tk Pr XBM (T) = 1. The long term proportion of a transition t, denoted t is the long term proportion of states that the stochastic process is in transition t:
Markov chain theory is then used to obtain the long-term proportion of state transitions. A Markov chain is a stochastic process X n in which the conditional distribution of X n+1 is independent of past states and only depends on the value of B n . If we assume there is no correlation between subsequent environmental choices, then this condition holds. A Markov chain is irreducible if every state can be reached from every other state. If the XBM specification is strongly-connected, this condition holds.
Fortunately, in practice, these assumptions are usually satisfied or represent a reasonable approximation of reality. In addition, for XBM specifications that are not strongly-connected, simple extensions to this analysis apply.
For an irreducible Markov chain it can be shown that the set of t 's for each transition t are the unique non-negative solutions to the following set of equations [25] :
where Pr XBM (t) is the conditional probability of state transition t. 
C.2 Definition of average latency
As mentioned earlier, latency is the critical performance metric of burst-mode circuits in which the environment is slow. Informally speaking, latency is the time after which the input burst has arrived and the output burst is generated and, if not hidden by the concurrent operation of other datapath components (e.g., multiplexing) [38] , is the performance overhead associated with sequencing datapath computations.
Each burst-mode state transition has a latency. The latency of two-phase state transitions is simply the maximum pattern arrival time over all outputs for the transition's input burst, i.e., lat(t) = max o2Y (pat (o; g(o) ; i-burst(t)))
The latency of a three phase state transitions, on the other hand, is the sum of the maximum pattern arrival time over all state variables for the transition's input burst, the delay through the feedback buffers, and the maximum pattern arrival time over all output variables for the transition'sstate variable burst. Note that for the state variable burst, the pattern arrival time of the fed-back inputs is initialized to 0.
The average latency is defined as the sum of the latencies weighted by the relative probability of the state transitions.
C.3 Definition of average cycle time
As mentioned earlier, the cycle time of the circuit becomes a critical performance metric when the environment is very fast. Cycle time, defined over a pair of sequential state transitions (t; t 0 ), is the minimum time between the end of the input burst of transition t and the beginning of the input burst of the next state transition t 0 . It equals the latency of the state transition t plus the required fundamental-mode constraint needed in preparation of t 0 's input burst. This fundamental-time constraint includes the feedback delay plus the settling time required for the fed-back signals to propagate into the combinational logic deeply enough to ensure that no unexpected hazards occur when t 0 's input burst arrives. Thus, cyc(t; t 0 ) = lat(t) + fb-delay + settling-time(t; t 0 ): (7) The average cycle time is defined as the weighted sum of the cycle times of all state transition pairs. The weight of each state transition pair (t; t 0 ) equals the long-term probability of being in state transition t, t , times the conditional probability that the next transition is t 0 given the current state transition is t.
IV. PHASE 1: DECOMPOSITION FOR AVERAGE LATENCY AND CYCLE TIME
The goal of decomposition is to transform a circuit into an equivalent network that consists of only a set of base functions. These base functions form a "fine-grain" representation which provides the subsequent covering step with more flexibility in how to cover the network using library gates. By applying standard tree-decomposition technique [28] , for a given circuit, we create a NAND-decomposed network whose base functions are INVERTERs and 2-input NANDs. The results of Unger [35] and Kung [16] ensure that this process preserves hazard-freedom.
An important observation is that there may be many possible NAND-decomposed networks obtainable through tree decomposition but they are all hazard-free. The choice of which NAND-decomposed network to use, however, can have a significant impact on the performance of the final mapped circuit. Thus, in order to obtain a better final mapped circuit, we propose an efficient heuristic to find a NAND-decomposed network with optimal performance. Our heuristic is motivated by Rudell's well-known optimization for worst-case delay [28] . Rudell repeatedly identified each 3-input NAND subgraph in a decomposed network and rearranged their fanin cones to optimize the worst-case latency of the network. We call this rearrangement of fanin cones "NAND3 rotation." Each NAND3 rotation creates one of three isomorphic NAND3 subgraphs, as illustrated in Fig. 3 . Consider the isomorphic NAND3 labeled r 0 . By reconnecting fanin a to input c, fanin c to input b, and fanin b to input a, we complete a clockwise "rotation" of fanins and produce the isomorphic NAND3 r 1 . Rudell repeatedly rotates each NAND3 subgraph in a decomposed network such that the latest arriving input of the NAND3 is pushed closer to the output. Similarly, our heuristic repeatedly rotates each NAND3 subgraph in a decomposed network and chooses the rotation which minimizes the user-specified cost function. Since NAND3 rotation is equivalent to simply choosing a different tree decomposition of the network, it is hazardpreserving. Currently, the user can specify as a cost function either average latency or average cycle time, however, extensions to any combination of the two are trivial.
Although this heuristic is somewhat simple, our experimental results (see Section VI) demonstrate that it is both computationally manageable and very effective. Psuedocode of our implementation is given in Algorithm 1. Note that it calls Algorithm 2 to do rotation on each NAND3 of a network and uses a while loop to repeatedly rotate all NAND3s in the network until no further improvement is possible. More specifically, the objective function we measure is based on two parameters: the average latency or cycle time of the network (depending on user specification), and the average pattern arrival time of the node n where NAND3 m is rooted. The average latency or cycle time of the network is the principle objective function and the average pattern arrival time of the node is used to break ties in the overall average latency/cycle time of the network. This tie breaking feature often allows us to move out of local minimums. Note also that to recompute the average latency after each NAND3 rotation, the extended logic and event-driven simulation must be re-performed. Moreover, the average cycle time the fundamental-mode constraints must also be re-calculated. To save run-time, our algorithm re-analyzes only those nodes affected by the rotation, i.e., the nodes in the rotated NAND3 and in their transitive fanout. Since the network is unmapped, we assume that the circuit is trivially mapped to NAND2s and inverters and use estimated delays and load capacitances for these gates when simulating.
V. PHASE 2: COVERING FOR AVERAGE PERFORMANCE
Our covering procedure is divided into two stages. First, a postorder traversal (from DI's to DO's) uses dynamic programming to determine a set of covering solutions rooted at each node. Then a preorder traversal (from DO's to DI's) selects gates which maximizes the user-specified desired performance (latency or cycle time) subject to a given area constraint.
A. Postorder traversal for bottom-up matching
A.1 Algorithm overview
We first introduce our terminology, some of which is borrowed from [7] . The NAND-decomposed graph is referred to as a subject graph and the set of available library gates is denoted by L. A match h for a node n in the subject graph is a pattern graph of L that is isomorphic to a subgraph rooted at node n and satisfies the conditions found by Siegel and De Micheli to ensure hazard-freedom [30] . Each match h is associated with a set of gates which have the same pattern graph. The set of nodes in a match h for a node n is referred to as merged(n; h). The set of fanin nodes of merged(n; h) is denoted by inputs(n; h).
A point p is a tuple hh:g; pat; fbd; st; early; late; areai that contains the key parameters of a cover of a subgraph rooted at a node n. First, the point identifies the matching gate h:g which is used to derive the point. In addition, the point includes the function pat : P i ! R which returns the pattern arrival time, pat i], for a set of input patterns P i for which i is sensitized, which as mentioned earlier, can include the following. For a two-phase state transition, there is one cld-pattern modeling the combinational logic delay of the input burst, one fbdpattern for determining the feedback delay, and one fmc-pattern for each possible next state transition for determining the corresponding fundamental-mode constraints. For a three-phase state transition, there is an additional cld-pattern associated with the state variable burst and an additional fbd-pattern to determine the feedback delay needed to avoid essential hazards caused by the state variable burst. From these pattern arrival times, our algorithm estimates the minimum feedback delay fbd needed to avoid essential hazards for this mapping of the subgraph rooted at node n. In addition, our algorithm computes the settling times st (st : ! R) that are required for pairs of subsequent state transitions needed to ensure that no fundamental-mode constraint is violated. early and late are bookkeeping vectors used to store sets of fanins associated with the early and late transitions involved in any potential hazard at that node. Finally, area is the area at node n. Since many pattern arrival times are stored in a point, a point is actually multi-dimensional and the set of points forms a multi-dimensional "surface." This is an extension of the two-dimensional curve in [7] .
Algorithm 3 shows the pseudo code for building an areaperformance tradeoff surface at a node n. First, the set of matching gates h:g at n are found using the modified matching algorithm developed by Siegel and De Micheli [30] . For each matching gate, the algorithm creates many possible area-performance points depending on how the cones of logic rooted at the input of the match h, inputs(n; h), are mapped. Because the nodes are visited in postorder, the possibly optimal ways to implement the cones of logic rooted at inputs(n; h) have already been analyzed and stored in different points on surfaces at inputs(n; h). For each combination of surface points at inputs(n; h), an area-performance point is computed as follows. To compute the average cycle time for a point p, the first step is to compute the pattern arrival time for every cld-pattern, fmc-pattern, and fbdpattern. For each fbd-pattern i 2 P fbd , the algorithm then selects the maximum feedback delay among the feedback delays stored in in pts as the default feedback delay for p for resolving essential hazards which may occur in the fanin cones of h. If g is an essential-hazard problem gate rooted at n for pattern i, the algorithm also estimates a new feedback delay to avoid an essential hazard occurred at gate g when i is applied (as is explained in Section V-A.2). If the default feedback delay is smaller than the new computed feedback delay, the feedback delay for p is updated using the new feedback delay. For each fmc-pattern i 2 P fmc , the fmc for p is estimated similarly. But instead of only one fbd being stored in p, an array of fmc's for all fmcpatterns is stored. The area of the point is the area of h:g plus the total area of in pts. Then, the point is inserted into the surface and the function remove inferior pts() removes all surface points which are deemed unlikely to lead to an optimal covering (as is described in Section V-A.5). Consider the example depicted in Fig. 7 The area is the sum of the areas of a and e plus the area of the AND gate which is assumed to be 1, yielding 7. As is explained in Section V-A.5, only four of the 6 possible points generated are deemed likely to lead to a good covering; the other two are removed by the function remove inferior pts().
A.2 Estimating the required feedback delay and settling-times
Recall that the required feedback delay and settling-time of a particular problem gate is obtained from the difference between the propagation delay through the feedback signals and the propagation delay from the primary inputs to the problem gate [6] . Once the combinational logic portion of the circuit is mapped, this value can be readily obtained (as described in Section III-B.) During the covering process, however, the delay through the feedback signals is not known since the nodes are processed in postorder. In particular, the logic between the problem gate and the corresponding state variables/primary outputs has not yet been chosen.
To address this problem, our algorithm estimates the delay of each state variable/primary output z to be the pattern arrival time of z in the un-mapped network. Although this estimate is somewhat coarse, our experimental results suggest it is effective.
A.3 Extending the computation of pattern arrival times to complex gates
To calculate the pattern arrival time during covering, function comp pat arv time() must analyze matching gates that may be complex. For such gates, we algorithmically extend our definition of a dominant fanin. Specifically, we recursively traverse the NAND-decomposed subgraph representing the complex gate to simultaneously find the dominant fanin and pattern arrival times for a given point p. For each complex-gate input m, we calculate p:pat i] + p2p-delay(m; n; h:g; t), where t is the transition that node n makes in pattern i. We then recursively propagate these delays up through the subgraph selecting the delay and fanin node which is dominant for the current subgraph node.
For example, consider a matching AOI gate h:g rooted at node n where all fanins are non-controlling. The algorithm first chooses the maximum sum in each AND subgraph and then chooses the minimum sum for the OR subgraph among those maximum sums of AND subgraphs. The corresponding m for the minimum of the maximums is the dominant fanin m d of n and the minimum of maximums is the pattern arrival time p:pat i].
A.4 Accounting for unknown loads
The function comp pat arv time() uses the pin-dependent delay model in MIS [7] to model the delay of h:g from input pin m to output pin n, as follows, p2p-delay(n; m; h:g; t) = (n; m; h:g; t) + R(n; m; h:g; t) C n ; (8) where (n; m; h:g; t) is an intrinsic pin-to-pin delay of the gate, R(n; m; h:g; t) is a pin-specific output drive-resistance, and C n is the capacitive load at the output of h:g.
It is important to note that when the postorder procedure first visits each m 2 inputs(n; h), the exact output load C m is unknown because some of its fanouts may not have been mapped. In particular, this node may have fanouts in cones of logic that have been traversed as well as cones of logic that have yet to be traversed. For this reason, our algorithm must estimate its load. For the mapped fanouts it accounts for the input capacitance of the mapped gate. For the unmapped fanouts it uses a default input capacitance C d . This is known as the unknown load approx-
imation.
A useful observation made in [7] is that when the main procedure visits node n and computes a point for h:g, the exact output load for each node m is more precisely known because the input capacitance of its fanout n is determined by the gate h:g. More specifically, the previously unknown load contribution of h:g is now known.
To obtain a better estimate of the pattern arrival time at n, the 
A.5 Removing points with inferior average performance
The criteria of determining the inferiority of a point affects the efficiency and effectiveness of the algorithm. A good criteria reduces a large number of points but still keeps the final solution as close to optimal as possible.
We first propose an exact technique to remove inferior points that depends on what the tool is optimizing. When optimizing for latency, a point is inferior if there exists another point on the surface which has less or equal latency for all state transitions. When optimizing for cycle time, the other point must have less or equal to cycle times for all state transition pairs. These definitions yields the optimal covers subject to load shift errors. However, under this definition, it is difficult for a point to be inferior since any individual latency/cycle time may violate the above condition. Our experimental results in Section VI show that the number of non-inferior points becomes unmanageable for large circuits.
Thus, in order to reduce the number of points, we propose a new heuristic to quantify the potential merits of one point over another based on the notions of a point's average latency and cycle time. Specifically, the point average latency is an estimate of the average latency of the circuit up to that node for the cover of the subgraph represented by the point. It is defined as the sum of the point's state transition latencies weighted by the probability of the state transition. For a two-phase state transition the point's state transition latency is simply the point's pattern arrival time for the cld-pattern associated with this state transition. For a three-phase state transition, the point's state transition latency also consists of the pattern arrival time for the cld-pattern associated with the state burst and the current estimated value of the required feedback delay.
Similarly, the point average cycle time is an estimate of the average cycle time of the circuit up to that node for the cover of the subgraph represented by the point. It is defined as the point average latency plus the current estimate of the average required fundamental-mode constraint.
A point has inferior average performance if there exists any other point on the surface which has less or equal area and less or equal average performance, i.e., latency or cycle time depending on which the user directs the mapper to optimize. Our experimental results indicate that judging inferiority based on point average latencies and cycle times is very effective. In other words, these quantities adequately reflect the relative overall performance impact of various mapping options for a given subgraph.
In our implementation, we maintain the area-performance surface as an area-ascending ordered list to make removing average performance inferior points efficient. Specifically, a point p divides the list into two sets A and B. Each point in set A has no larger area than the area of p and each point in set B has larger area than p. If there exists any point in set A with better (or equal) average performance than p, p has inferior average performance and is removed from the list. If not, p is not removed and any point p 0 in set B that has worse (or equal) average performance than p is removed. Thus, the list remains not only areaascending order but also average-performance-descending order by construction. Consequently, removing inferior average performance points takes linear time with respect to the number of points in the list.
Recall that in Fig. 7 only 4 points are stored at D instead of 6 since both (c; d) and (b; e) generate likely inferior points. This is because (b; e), (c; d), and (c; e) all have the same average latency of 4.5 but point (c; e) has the lowest area of 4.
It is important to note why dropping points with inferior average performance may lead to a sub-optimal solution. Consider the case where the critical path for Pattern 2 is not through this AND gate and is through a different part of the graph. Then, the only pattern arrival time of importance is Pattern 1. For the point from (b; e), the pattern arrival time for pattern 1 is 4. Because this is the smallest pattern arrival time for Pattern 1 between the points with an area of 4 or higher, it could be part of the optimum solution and should not be dropped. Fortunately, our experimental results indicate that this heuristic effectively removes most surface points yet still maintains close to optimal solutions. This is because if the heuristic picks a point that yields a non-optimal solution, the difference from the optimal solution is probably small since the point would most likely have significantly sub-optimal pattern arrival times for only very infrequent patterns.
We also note that for multi-fanout nodes, we adopt Chaudhary and Pedram's heuristic of dividing the area of each point by the number of its fanouts [7] . This means that the relative area cost of solutions in which the multi-fanout node is internal to a gate is very high. Consequently, this heuristic favors solutions in which the multi-fanout node is a gate output rather than internal to a gate.
In addition, to further reduce the number of points, we adopt the well-known binning technique [28] in which if the difference between two performance metrics is less than a user-defined bin size, the performance metrics are considered equal, yielding many more inferior points.
B. Decomposing the mapping of multi-output circuits
Ideally, once area-performance tradeoff surfaces at all DO's are computed, we would select a combination of DO points (one for each DO) which yield the best average performance over all possible point combinations that satisfy a given total area constraint. This, however, can be computationally very expensive since the number of DO point combinations that would need to be analyzed can be huge. Thus, we propose a fast heuristic to find a good, but potentially non-optimal, solution which is to iteratively cover each output cone individually and sequentially, using a decomposed area constraint. Another important advantage of this approach is that once an output cone is covered, all surface points on nodes in that cone can be freed, dramatically reducing the total memory needed for the algorithm.
It may be useful to explain why this heuristic may not yield a circuit with optimum performance. Consider the case where we are optimizing the average latency of a circuit that has two output cones, referred to as A and B, and two state transitions, t 1 with probability 0:9 and t 2 with probability 0:1. Assume for simplicity that these state transitions have only two-phases and thus feedback delays need not be considered. In the frequent state transition t 1 assume both outputs transition, whereas in the less frequent state transition t 2 assume only the output of cone A transitions. Because we process cones individually and sequentially, when we optimize output cone A we favor the pattern arrival times corresponding to t 1 . Unfortunately, it might be the case that the latency of t 1 is dictated by output cone B. Thus, in this case, the ideal thing to do would be to optimize A for the less frequent state transition t 2 (rather than t 1 ) because this would yield lower average latency. This problem occurs because cone A is only critical for relatively infrequent patterns. Fortunately, this means that the difference from the optimal solution is usually small since only the pattern arrival times of relatively infrequent patterns remain nonoptimal. In addition, this problem can be mitigated by relaxing the required times for state transitions at a DO to be equal to the maximum corresponding pattern arrival times obtained from previously mapped cones. Also note that our estimate of the required feedback delay and settling times are automatically updated during the process of mapping each cone. Thus, cones mapped later in the algorithms have the benefit of more accurate estimates.
C. Preorder traversal for top-down selection
Now assume we have selected a point in a DO's surface whose performance is the best of all points that satisfy the given local area constraint. Since the area and performance of this point is based on the unknown load approximation, the set of gates that lead to this point may actually yield different area and performance characteristics from that estimated by the point [7] . Thus, our algorithm performs a separate preorder traversal to act ually select the remaining gates in the cover of this cone. In this way, we can keep the overall difference between estimated and actual area and performance metrics to be typically less than 10%.
The first step of Algorithm 4 is to call find best point() to find the best point at node n. To facilitate this, the node data structure contains an array of pattern required times, a corresponding array of directions of importance, an array of settling required times, and a feedback required delay. The direction of importance for a given pattern indicates whether the pattern arrival time of the point chosen should have a larger or smaller value. The direction of all cld-patterns is always smaller because the chosen point should have a delay equal to or smaller than the required times. However, for fmc-patterns and fbd-patterns, both directions are needed. Recall that our algorithm computes the settling time and feedback delay by taking the difference of two pattern arrival times, one corresponding to an early transition and one corresponding to a late transition (these are computed from req max fbd:early i] and req max fbd:late i] respectively). To reflect the desire for the point to have a smaller difference, the chosen point's pattern arrival time for the early transition needs to be equal to or smaller than the required pattern arrival time while the point's pattern arrival time for the late transition needs to be equal to or larger than the required pattern arrival time. Due to the unknown load problem, it may not be possible to find a point that satisfies all these constraints. For this reason, as illustrated in Algorithm 5, we compute a load shift error to represent by how much a point violates these constraints. Specifically, the load shift error is the weighted sum of the amounts that the pattern arrival time violates the corresponding pattern required time. For a point to satisfy all pattern required times, the load shift error must equal zero. When no point satisfies all the pattern required times, the function selects the point with the smallest load shift error.
After finding the best point, Algorithm 4, sets the pattern required times for all fanins of the chosen gate. Note that for a DO n, whose point is selected by the user, prt(n; i) = n:best p:pat i] for all patterns i for which n is sensitized and prt(n; i) = 1 for all other patterns. For all other nodes, the algorithm initializes the pattern required times as follows.
prt(m; i) = prt(n; i) ? p2p-delay(n; m; best p:h:g; t); (9) where best p:h:g is the best matched gate rooted at n stored in point best p.
Next, Algorithm 4 checks whether the chosen gate is an essential-hazard problem gate. If so, it computes the slack between best p's stored feedback delay and the maximum feedback delay allowed. This slack, which can be either positive or negative, is used to compute a secondary pattern required times for the fanins. If this secondary pattern required time is more constraining than the initially computed pattern required time, the pattern required time is updated accordingly. The algorithm performs a similar process if the chosen gate is a fundamentalmode problem gate. Note that we initialize the maximum required feedback delay and settling times at the DO's to the values in the point best p selected at the DO. 
D. Complexity analysis
We focus our complexity analysis on our covering algorithm because it dominates the overall run time. We first consider a match h at node n where the matching gate h:g has l inputs and each input k has N(k) surface points. The number of points at n is bounded by Q l k=0 N(k). Assuming a fixed library size, the total number of matches at a node is a constant. We can thus conclude that the number of points from one level of the graph to the next grows at most polynomially with a degree that is bounded by the maximum number of gate inputs.
The time complexity to derive each point consists of the sum The high computational complexity of the algorithm is mitigated by a variety of factors. First, asynchronous control circuits tend to be relatively small (less than 1K gates). This is because to achieve good performance asynchronous systems are typically designed using a distributed control paradigm consisting of many relatively small controllers running concurrently rather than a single large controller. Second, as mentioned earlier, the number of patterns of interest for each output tends to be small. And, finally, most derived area-performance points have inferior average performance and are therefore dropped. Indeed, our preliminary results, described in Section VI, are much more promising than the above worst-case complexity analysis suggests.
VI. EXPERIMENTAL RESULTS
We conducted experiments on a suite of benchmark circuits, presented in Table II , using a AMD-K6-233MHz/Linux PC with 128 Megabytes of memory. Each circuit is specified in an unoptimized Verilog netlist along with an XBM specification annotated with conditional probabilities. Using Markov chain analysis on the annotated XBM specification, we automatically generate the set of input patterns and their associated probabilities. The decomposition and covering algorithms are implemented on top of the POSE environment [12] , which is the extension of the SIS framework [29] 
A. Decomposition results
We first used the SIS function tech decomp to tree-decompose each benchmark into a hazard-free NAND-decomposed unoptimized network. We then applied our decomposition heuristic to rotate the network optimizing first for latency and secondly for cycle time. Table III reports the average latency/cycle-time for both the un-optimized and optimized circuits. The run-time for all circuits were less than 1 hour, with all but two circuits taking less than 7 minutes. The performance comparison between un-optimized and optimized circuits demonstrates that a rotated NAND-decomposed network typically has significant improvement (from 4% up to 39%) in average performance compared to its un-optimized NAND-decomposed network.
B. Covering results
We conducted three mapping experiments on each benchmark: average-case mapping without rotation, average-case mapping with rotation, and worst-case mapping using a modified version of Chaudhary and Pedram's mapper that minimizes worst-case delay. For all experiments, we used the lib2 gate library included in the SIS package. For simplicity, we assumed that all gates in lib2 are hazard-free without running De Micheli and Siegel's algorithm to determine hazardous gates in lib2. The matching routine we used is borrowed from SIS and always associates the pin of a simple gate that has the smallest intrinsic delay to the input of a pattern graph with the longest path to the output. This choice of pin association favors covering longer paths of a network with shorter delay, leading to a more balanced covering that effectively minimizes worst-case delay. This unfortunately counteracts our efforts to create an unbalanced covering to shorten true critical paths. In order to truly reflect the potential advantage obtained from the rotation, we reversed the pin delay assignment of simple gates (e.g., NANDs, NORs) such that the pin with the shortest path to the output has the smallest intrinsic delay for our average-case mapping.
B.1 Evaluation of inferior point schemes
We first present Table IV that shows the number of surface points we allocated and the run-time for each benchmark for both the exact and heuristic point removal schemes. It also lists the average performance using the heuristic and exact technique in the experiments of our mapper with rotation. As we can see, the performance of the circuits obtained with the heuristic and the circuits obtained with the exact technique are almost identical in all cases, demonstrating that the heuristic technique does not significantly reduce the circuits' quality. Moreover, the results demonstrate the memory and run-time efficiency of the heuristic. For example, when optimizing PE-SEND-IFC for average latency, the exact technique with rotation allocated 58,401 points, whereas the heuristic only allocated 1469 points. The run-time of the algorithm is reduced dramatically (1640.8s vs. 17.1s). For the five largest benchmark circuits, we used binning technique to keep the complexity manageable. Given equal bin sizes, the exact technique cannot complete after 1 hour on twelve of the experiments, whereas the heuristic technique completes in all cases.
B.2 Analysis of impact of rotation on mapped circuits
We next performed experiments to determine the impact on our rotation scheme on the mapped circuits. The results, shown in Table V, show that rotation leads to better mapping in almost every circuit, most often yielding improvements of over 5%.
Interestingly, the results suggest that the improvement in the mapped circuits is sometimes limited by the structure of the circuit. For small shallow networks, such as Q42, the improvement of average latency in the mapped circuits is significantly smaller than observed in the un-mapped circuits. This is because a large reduction of critical path delay can be achieved when covering using unbalanced library graphs without doing any rotation. In some of the larger and deeper networks (e.g., BINARY-COUNTER, TSEND and SCSI), however, rotation yields substantial improvements in the delay of the mapped circuits. This suggests that, sometimes, less gates are needed to cover the critical path of the rotated networks compared to the un-rotated networks.
C. Comparison to worst-case mapped circuits
We also compare our best average-case mapper (with rotation) to Chaudhary and Pedram's worst-case mapper and report the percentage improvement in average performance. It is very important to note that this comparison is not intended to yield an apple-to-apple measurement between comparable technology mappers but rather is intended to establish an estimate of how much our mapper can benefit from optimizing for the average case. case mapper can yield significant average-case performance improvements over the traditional worst-case mapper. The source of improvements can be decomposed in two major factors. First, we find and optimize only the true critical path delay for each pattern. Second, we find the probability for each pattern and prioritize the mapping for patterns with higher probabilities. Since the circuits MERGE, BUFCTRL1, Q42 and BINARY-COUNTER have uniform pattern frequencies, the obtained improvements can be attributed to the first factor. The other circuits have a more skewed distribution of pattern probabilities and thus the combination of both factors lead to significant improvement. Table VII shows that for 12 out of the 15 circuits tested our area is smaller than the area obtained using the worst-case mapper. This area reduction may be attributed to the fact that our algorithm uses large gates only on highly-frequent critical paths.
D. Post-layout results
To further validate our results we integrated our techniques with the Mentor Graphics physical design tools and evaluated the automatically-generated layout using 0:8 m Government CMOSN standard cells. Specifically, we extracted the wirecapacitances from the layout and used them to analyze the impact on wire-delays. The results demonstrate that the wire capacitance changed the average performance by very little. Consequently, our post-layout improvements are similar in magnitude to our pre-layout values. The delay contribution of wire delay is small because the Mentor Graphic's router does a good job of keeping wires short for these relatively small circuits and because the standard cells in the CMOSN library are relatively large which makes their intrinsic gate delay dominate the associated wire delay.
VII. CONCLUSIONS
We presented a technology mapping technique for optimizing the average performance of asynchronous burst-mode control circuits. We use stochastic techniques to determine the relative frequency of occurrence of each state transition and extend known analysis, decomposition, and covering techniques to optimize the mapped circuits for average latency/cycle time. Our results demonstrate that our techniques can simultaneously lead to significantly higher average performance and significantly smaller area when compared to traditional techniques. We also believe that these technology mapping algorithms are applicable to other fundamental-mode design styles, such as Nowick's UCLOCK method [23] .
Extensions of our approach to fundamental-mode circuits implemented with generalized C-elements [37] is an interesting area of future research. Using generalized C-elements often increases the performance of the circuits, but their optimization would most likely involve handling hazard-free sequential decomposition as well as transistor-level delay analysis. Another possible direction for future work is to extend this type of average-case optimization to other control circuit design styles (e.g., quasi-delay insensitive, speed-independent, timed, etc.) or other levels of the design hierarchy (e.g., architectural and logic synthesis, placement and routing, fanout optimization, etc.). Lastly, it may be possible to apply our average-case methodology to the technology mapping of synchronous telescopic units [2] . These blocks of combinational logic have variable latency (they either take one or two clock cycles) and mapping them for the average case can increase the probability that the unit takes only one clock cycle, thereby increasing system performance. 
