Abstract-This paper describes a pseudo-polynomial time algorithm for timing analysis of a class of choice-free asynchronous systems, called tightly-coupled systems, with both min and max type timing constraints and bounded component delays. The algorithm consists of two phases: (i) long-term behavior analysis, that computes bounds on the time separation of events after the system has run for a sufficiently long period of time, and (ii) startup behavior analysis, that computes time separations between events during an initial startup period after the system is powered up. The results of the analysis are conservative in the worst-case; nevertheless, they are exact in our experiments. To demonstrate the practical utility of the approach, an asynchronous differential equation solver chip has been modeled and analyzed using the proposed algorithm. We report results of datapath timing verification, inter-controller protocol timing verification and performance analysis of the chip using the proposed technique.
I. INTRODUCTION
There is mounting evidence that asynchronous circuits are finding a niche in high-performance applications, such as Intel's asynchronous instruction length decoder [1] , [2] and the asynchronous differential equation solver benchmark circuit [3] . Common traits in these systems are a high degree of concurrency, distributed control and implicit timing assumptions to hide the control overhead. In order to guarantee correct operation of these systems as well as to estimate their performance, efficient timing analysis techniques are needed. The problem is compounded by the fact that statistical variations in manufacturing and operating conditions result in uncertainties in component delays in a chip. Consequently, timing analysis techniques for asynchronous systems with uncertain delays are needed.
A central problem in the analysis of asynchronous systems is computing bounds on the time separation between events [4] , [5] , [6] , [7] , [8] , [9] , [10] , [11] , [12] , [13] , [14] . Existing techniques typically focus on controller behavior and intercontroller communication protocols, that involve synchronization or max type timing constraints in addition to linear constraints between event times. Systems with max-only or maxand-linear constraints have been well-studied and efficient algorithms exist for computing time separations in these systems [8] , [9] , [15] , [16] , [17] , [11] , [14] . However, modern highperformance asynchronous systems (e.g., the system in [3] ) make implicit timing assumptions in both the datapath and control to facilitate design of high-speed circuits. Therefore, it is important to model and analyze both datapath and controller behaviors. Modeling datapath behavior, however, requires the use of min type timing constraints, especially when modeling components at a low level of abstraction. Similarly, performance metrics, such as the loop delay of an iterative process, are best modeled with both min and max type timing constraints. Therefore, there is a need for extending existing techniques for maxonly systems to systems with min and max constraints, while remaining efficient.
Unfortunately, the problem of computing time separation bounds in systems with both min and max constraints is known to be computationally intractable [8] . Earlier researchers have proposed worst-case exponential-time techniques for analyzing such systems [18] , [8] , [13] , [19] . However, exponential-time techniques are not always well-suited in practice, e.g., in a design-analyze-redesign environment where multiple passes of analysis are often needed. More efficient timing analysis techniques that are approximate in the worst-case, but compute reasonably accurate results in practice may be more useful in such situations. Such techniques can also serve as filters to quickly narrow down the set of potential timing problems in a large design. The reduced set of potential problems can then be analyzed by more expensive exact techniques, leading to a pragmatic approach to timing analysis of large systems.
The above considerations motivate the current work. We describe an efficient algorithm for computing approximate bounds on the time separation of events in a class of asynchronous systems called tightly-coupled systems. Events in a tightly-coupled system may repeat over time, component delays can assume arbitrary values between specified bounds, and both min and max type timing constraints may be present. However, for efficiency of analysis, the system is assumed to be choice-free. Although this restriction is significant, we believe that a large and important class of applications can be modeled and analyzed as tightly-coupled systems. The results of such an analysis must, however, be interpreted carefully -they apply only during choice-free operation of the system. To demonstrate the practical utility of the approach, an asynchronous differential equation solver has been modeled and analyzed using the proposed algorithm.
The main contributions of this paper are summarized below. Design of an efficient algorithm for computing approximate bounds on the time separation of events in a class of choicefree systems with both min and max type timing constraints. For systems with integer-bounded delays, an upper bound on the time complexity is derived.
Application of the above algorithm to verify timing constraints and estimate performance bounds of an asynchronous differential equation solver chip.
The remainder of this paper is organized as follows. Section II surveys related work on timing analysis of asynchronous systems. Section III describes cyclic timing constraint graphs as a formalism for representing timing constraints in systems with repeated events and gives a characterization of tightly-coupled systems. Section IV describes an efficient algorithm for computing approximate bounds on the time separation of events in tightly-coupled systems. Section V describes application of the algorithm to verify timing constraints and estimate performance bounds of an asynchronous differential equation solver chip. Section VI describes the limitations of the proposed technique. Section VII concludes the paper.
II. RELATED WORK
Burns [20] defined the cycle period of an asynchronous system as the asymptotic average time separation between consecutive occurrences of the same event, and proposed a polynomialtime algorithm for computing the period of max-only systems with fixed component delays. Subsequently, Lee [21] extended this work to systems with disjunctive constraints, of which min constraints are a special case. The worst-case complexity of Lee's technique is exponential in the size of the system. Gunawardena [22] gave a theoretical framework for analyzing systems of min and max constraints with fixed delays. He studied the periodic behavior of such systems and gave an analytic formula for the period. Several other researchers have also modeled asynchronous circuits as systems of repeated events with varying delay assumptions, and proposed different techniques for computing their performance metrics. A few representative works are those of Williams [23] , Nielsen and Kishinevsky [24] , Tofts [25] , Kudva, Gopalakrishnan and Brunvand [26] , Xie and Beerel [27] and Ebergen and Berks [28] , [29] .
McMillan and Dill [8] showed that finding exact bounds on the time separation of events is computationally intractable when both min and max timing constraints are present and component delays are uncertain but bounded. They proposed a polynomial-time algorithm for analyzing systems of nonrepeated events with only max constraints, and gave a pseudopolynomial time algorithm for systems with max and linear constraints. They also described a branch-and-bound algorithm, with worst-case exponential running time, for analyzing systems of min, max and linear constraints. Vanbekbergen et al. [9] proposed a cubic time algorithm for max-only systems with nonrepeated events. Burks and Sakallah [19] described a branchand-bound technique and a mixed integer linear program formulation for analyzing systems with min, max and linear timing constraints; however, both their techniques have worst-case exponential running time. In her dissertation [11] , Walkup described an algorithm for analyzing systems of linear and max constraints, with applications to interface timing verification and interface logic synthesis. Yen et al. [14] have described another algorithm for interface timing verification in systems with max and linear constraints. The algorithms of both Walkup and Yen apply to systems with non-repeated events, and are conjectured to run in polynomial-time. The exact complexity of the max and linear problem is, unfortunately, not yet known.
Myers and Meng described a polynomial-time algorithm for approximate timing analysis of max-only systems with repeated events and bounded delays [15] . In their method, a cyclic graph representation, called an event-rule system, is used to represent temporal dependencies between events. Their algorithm effectively unfolds this graph into an infinite acyclic graph and analyzes a finite subgraph to compute approximate bounds on the time separation between events. Amon, Hulgaard, Burns and Borriello [16] , [17] considered systems of repeated events with only max constraints, and proposed an algebraic technique for computing exact bounds on the time separation of events. Their technique implicitly unfolds a cyclic graph into an infinite acyclic graph and then uses algebraic techniques to determine the maximum time separation between a pair of events, maximized over all unfoldings. The worst-case complexity of their technique is exponential in the size of the system description. Hulgaard and Burns [30] , [13] also extended this work to systems with certain types of choices. Among other related works, Amon and Borriello [18] have described a symbolic timing verification technique based on constraint logic programming (CLP). Girodias, Cerny and Older [31] have described another timing verification technique based on CLP with relational interval arithmetic. Symbolic timing verification using Presburger formulas has been studied recently by Amon et al. [32] . These techniques, however, have worst-case complexity at least exponential in the size of the system description. Chakraborty et al. [33] have described a polynomial-time technique for obtaining conservative estimates of timing constraints required for correct operation of a class of asynchronous circuits. In [3] , a technique for uncovering timing bugs in an asynchronous differential equation solver chip has been reported.
The work that comes closest to the present work is that of Myers and Meng [15] . However, there are important extensions to and differences from their work, as summarized below.
The proposed algorithm can be used to analyze a class of systems with both min and max timing constraints. In contrast, Myers and Meng's algorithm is applicable to max-only systems.
For systems with finite component delays, it is shown that after the system has run for a sufficiently long period of time, finite bounds on the time separation of events can be obtained even without any knowledge of the the behavior of the system immediately after power-up. Although this applies to Myers and Meng's algorithm as well, they do not prove that the bounds are finite.
The proposed algorithm consists of an iterative refinement phase, in which the computed time separation bounds monotonically approach their asymptotic values with every iteration. This is called monotonic convergence of the bounds. Myers and Meng do not study similar properties of the computed bounds.
A notion of convergence of analysis is defined, and an analytic upper bound on the number of iterations until converegence is derived for systems with integer-bounded delays. The work of Myers and Meng does not give such an analytic bound.
III. PROBLEM REPRESENTATION AND FORMALIZATION
This section describes cyclic timing constraint graphs as a formalism for representing temporal dependencies in choice-free systems with repeated events. Tightly-coupled systems are characterized, and the time separations problem is then formalized.
A. Cyclic timing constraint graphs
A cyclic timing constraint graph is a directed, labeled graph G = (V; E). In general, the graph has two components -an acyclic component modeling the behavior of the system immediately after it is powered up or reset, and a cyclic component modeling the subsequent repeated behavior (see Fig. 1 ). A vertex in the acyclic component represents a single occurrence of an event, whereas a vertex in the cyclic component represents infinite occurrences of an event. The notation of Amon et al [16] , [17] In this paper, a vertex with a min operator is represented by a circle, and one with a max operator is represented by a square (see Fig. 1 ). For consistency of notation, a vertex is referred to as an event in the remainder of this paper. within the corresponding bounds, the time of occurrence of every event is well-defined, and the occurrence of one event cannot disable the occurrence of any other event. Hence, systems with conflict or choice cannot be modeled by cyclic timing constraint graphs.
Following the terminology of Amon et al. [16] , [17] and Myers et al. [15] , a timing constraint graph is said to be wellformed if the following conditions are satisfied: (a) every cycle has at least one marked edge, and (b) for every event v in the cyclic component, there exists at least one cycle with exactly one marked edge that contains v. Condition (a) ensures that no event is deadlocked, waiting for itself to occur. Condition (b) ensures that the time of occurrence of v k is well-defined for every occurrence index k. All cyclic graphs considered in this paper are assumed to be finite and well-formed.
Let G = (V; E) be a finite, well-formed cyclic timing constraint graph. The timing constraints represented by G can be equivalently represented by an infinite acyclic graph G = (V ; E ) constructed as follows. [16] , [15] , [17] , G is called the unfolded graph of G. Given a set, X, of events, the set of events reachable from some v in X is denoted R(X). The set of events unreachable from all events in X is denoted R 0 (X). A cutset, C, is a finite set of events such that every path from Reset to every event reachable from C passes through at least one event in C. For example, the set fa 1 ; b 1 g in Fig. 2 As an example, consider the unfolded graph in Fig. 2 . The shaded sets of events, C 1 , C 2 and C 3 , represent the first three cutsets in an infinite sequence satisfying properties P1 through P4. Therefore, the cyclic timing constraint graph in Fig. 1 is tightly-coupled.
The reader may consider property P4 to be overly restrictive and wonder if removing min-type events, such as d i in Fig. 2 , keeps the temporal behavior of the rest of the system unchanged.
However, it can be shown that removing the d i 's in Fig. 2 causes the maximum time separation from b i to a i to change, for all i greater than or equal to 2. Property P4 is, in fact, equivalent to the requirement of strongly connected graphs in the max-only systems considered by Amon et al. [16] , Hulgaard et al. [17] and Myers and Meng [15] . Indeed, the systems considered in these earlier works are tightly-coupled systems with only max type events.
Let G i = (V i ; E i ) denote the subgraph "sandwiched" between cutsets C i and C i+1 , for all i greater than or equal to
) and E i = fhu; vi : u; v 2 V i and hu; vi 2 E g. For completeness, subgraph G 0 = (V 0 ; E 0 ) is defined as the component modeling the behavior of the system from Reset to the events in C 1 , i.e., V 0 = R 0 (C 1 ) C 1 and E 0 = fhu; vi : u; v 2 V 0 and hu; vi 2 E g.
Due to the repetitive structure of the unfolded graph, it is easy to see that G i is isomorphic to G 1 , for all i greater than or equal to 1. For a given choice of cutsets, the behavior modeled by G 0 is referred to as the initial behavior of the system, and the behavior modeled by G i (i 1) is referred to as iteration i of the system.
If in property P2 is chosen to be greater than 1 (see, for example, Fig. 7 ), subgraph G i may include multiple occurrences of the same event. This can lead to confusion about the labeling of events. To keep the notation unambiguous, the events in the unfolded graph are relabeled as follows:
All events in G 0 and G 1 , except those in cutset C 2 , are assigned unique labels with occurrence index 1.
For every event, v 1 , in G 1 , the corresponding event in G i (correspondence defined by the isomorphism from G 1 to G i ) is labeled v i . This, of course, implies that C i = fv i : v 1 2 C 1 g, so in property P2 reduces to 1.
C. The Problem
The time separation of events problem for tightly-coupled systems can now be formalized as follows:
Given subgraphs G i (i 0) of the unfolded graph,
Determine upper bounds on the time separations of all or-
dered pairs of events in G 0 .
Determine upper bounds on the time separations of all ordered pairs of events in G i
, maximized over all i 1.
IV. ANALYSIS OF TIGHTLY-COUPLED SYSTEMS
This section describes a pseudo-polynomial time algorithm for computing the time separation bounds described above.
Since G i is isomorphic to G 1 for all i greater than or equal to 1, it suffices to analyze subgraphs G 0 and G 1 . The algorithm consists of two phases, as summarized below.
The first phase uses a successive refinement approach to compute bounds on the time separation of events after the system has run for a sufficiently long period of time. Specifically, if K denotes the number of successive refinement iterations, the computed bounds apply to every subgraph G i for i K. This is explained in detail in Section IV-A below. Since the bounds computed in this phase apply only after the system has run for some time, they are called long-term time separation bounds and this phase of analysis is called long-term behavior analysis.
Assuming that the algorithm iterates K times in the first phase, the second phase computes time separations between events in the finite acyclic graph modeling the system behavior from Reset to the events in cutset C K . Bounds on the time separation of events in G 0 are obtained directly from the second phase. To obtain an upper bound on the time separation of a pair of events in G i , maximized over all i 1, the maximum of the corresponding bounds computed for subgraphs G 1 through G K is determined. The bounds for G 1 through G K?1 are obtained from the second phase, and the bounds for G K (and for all G i with i > K) are obtained from the first phase. The following subsections describe each phase of the algorithm in greater detail.
A. Phase I of Analysis
We first concentrate on computing bounds on the time separation of events in subgraph G 1 . The isomorphism of G 1 and G i for all i 1 is then exploited to compute bounds on the time separation of events in subgraphs G i for i 1 .
Given an event v in G 1 , let (v) denote the set of paths from events in cutset C 1 to v. The shortest and longest path lengths to v, denoted l(v) and L(v), are defined as follows: l(v) = min P2 (v) P hx;yi along P d x;y , and L(v) = max P2 (v) P hx;yi along P D x;y . Note that if v is in C 1 , there exists a degenerate path (v) from v to itself, so l(v) is at most 0 and L(v) is at least 0. Since edge delays in a timing constraint graph represent component delays in a system or its environment, edge delay bounds may be finite or infinite in general. An infinite delay bound may be required, for example, to model the response time of an environment whose delay characteristics are unknown. In this paper, however, all systems and environments are assumed to have finite, non-negative delay delay bounds. Since subgraph G 1 is finite, it follows that l(v) and L(v) are non-negative and finite for all events v in G 1 . A brief discussion of the implications of infinite edge delay bounds is given in Section VI.
Let n denote the number of events in G 1 . Bounds on the time separation of events in G 1 are represented using an n n matrix, . The entry in row u and column v, denoted (u; v), represents an upper bound on the time separation from event u to event v. Since subgraph G 1 is acyclic, an algorithm for analyzing acyclic timing constraint graphs suffices to compute the entries of the matrix. For the current work, the AcyclicApproxSep algorithm of Chakraborty and Dill [34] , outlined in Fig. 3 , is used for this purpose. 1 Since G 1 lies between cutsets C 1 and C 2 , the source events of G 1 are elements of cutset C 1 . Therefore, in order to apply algorithm AcyclicApproxSep to G 1 , the values of (a 1 ; b 1 ) for every pair of events a 1 and b 1 in C 1 are required. Normally, this may be obtained by an analysis of subgraph G 0 . However, in this work, we choose to decouple the analysis of the initial behavior from the long-term behavior analysis and assume that the initial behavior is unknown during phase I of the analysis. Thus, the entries (a 1 ; b 1 ) for every a 1 and b 1 in C 1 are conservatively set to +1 and algorithm AcyclicApproxSep invoked. As will be seen later, this ensures that the long-term behavior analysis terminates within a pseudopolynomial number of steps, while producing finite, conservative bounds on the long-term time separation of events. Note that since no assumptions are made about the initial behavior of the system, the time separation bounds computed in this phase apply regardless of the actual initial behavior. Therefore, an upper bound computed in this phase may be viewed as maximized over all possible initial behaviors.
In general, setting the upper bound of the time separation between every pair of source events to +1 can lead to the computed bounds between other pairs of events to be +1 as well. However, this is not the case in tightly-coupled systems, as shown below.
1 Details of the algorithm may be found in [34] .
/* G is an acyclic timing constraint graph; n = total no. of */ /* events; m = no. of source events (no incoming edges for each i in 0 to (n ? 1) for each j in 0 to (n ? 1) The iterative analysis may be terminated when one of the following conditions is satisfied: If all edge delay bounds are integers, an analytic upper bound on the number of iterations required to converge can be derived. (see Lemma 2) . However, in practice, the analysis usually converges within a small number of iterations -at most 5 in our experiments. We conjecture that by choosing K max to be in the range 10 to 20, the analysis will converge in most practical systems. Condition (b) ensures that the analysis terminates after a pre-set number of iterations if it has not already converged by then.
The pseudo-code in Phase I of Fig. 4 describes the iterative algorithm outlined above. If there are n events in G 1 and p denotes the maximum number of predecessors of an event, the complexity of algorithm AcyclicApproxSep, as noted in [34] , is O(n 2 p). The complexity of Phase I is therefore O(K max n 2 p).
In order to characterize the bounds computed in Phase I, the M relation for matrices is defined as follows. Let and 0 be two n n matrices. is said to be bounded above by 0 , denoted M 0 , if for all u and v in 0 through n ? 1, the value of (u; v) is no larger than the value of 0 (u; v). 
B. Phase II of Analysis
Suppose Phase I terminates after K iterations. By Theorem 2, the bounds in K apply to events in every subgraph G i for i K. To compute bounds on the time separation of events in G 0 through G K?1 , the initial segment of the unfolded graph, consisting of subgraphs G 0 through G K?1 is constructed and analyzed using the AcyclicApproxSep algorithm. This is shown in Phase II of the pseudo-code in O((m + K max n) 2 p). To compute bounds on the time separation of a pair of events in G i , maximized over all i 1, at most K max additional comparisons are needed to determine the maximum of the corresponding bounds computed for G 0 through G K . Repeating this for every pair of events in G i requires O(K max n 2 ) comparisons. Therefore, the complexity of the entire procedure is O((m + K max n) 2 
p).
There are three sources of conservativeness in the above analysis.
1. Since we set the time separation of every ordered pair of events in C 1 to +1 at the beginning of Phase I, the computed long-term time separation bounds may be conservative. This can happen if, for example, the exact long-term bounds depend on the initial behavior of the system. The reader may wonder why we need to start the analysis assuming no knowledge of the initial behavior. However, as may be seen from the basis case of the proof of Lemma 1 (see Appendix), this guarantees monotonic convergence of the bounds in Phase I, which, in turn, guarantees that the analysis converges within a finite number of steps (see proof of Lemma 2).
2. Algorithm AcyclicApproxSep may return conservative bounds in the worst-case.
3. If we terminate Phase I by hitting K max , rather than by converging, the computed bounds may be conservative. In our experiments, however, all the computed bounds are found to be exact.
C. Examples
Let us now apply the above algorithm to some examples. This illustrates some of the properties described above.
Example 2 Consider a two-processor system processing tasks from a task queue. Each task is divided into two subtasks; thus, there can be up to two subtasks at the head of the queue at any time. All subtasks at the head must belong to the same task.
Thus, if one subtask of task T i is scheduled on a processor, but the other is not, there will be only one subtask at the head of the queue.
Subtasks are assigned to the processors by a task manager as follows. Initially, each subtask of the first task is assigned to a different processor. Processing a subtask produces an intermediate result; intermediate results from the subtasks are then combined to generate the final result, which consists of two components. Each component of the final result is computed by a different processor. As soon as one component is computed, it is sent to the task manager. The task manager then schedules the next subtask from the head of the queue (if there are two subtasks at the head, one is arbitrarily chosen) on the idle processor, without waiting for the other component of the current result. Thus, subtasks of multiple tasks may be simultaneously active in the two-processor system.
To prevent subtasks of different tasks from interfering with each other, some bookkeeping is done for each task. Assume that for each task, this is done once immediately after the first subtask enters the system, and then again after the first intermediate result is produced. Assuming infinite tasks in the task queue, we wish to model the behavior of this system and determine bounds on the latency of a task, defined as the time from the first subtask entering the system to the final component of the result being generated.
Suppose [5, 7] (bookkeeping + subtask [2, 3] 
Example 3
The next example, shown in Fig. 7a , is adapted from Amon et al's work [12] . This example demonstrates that i greater than or equal to two. 2 It has been shown by Amon et al. [12] that M i depends on both i and , as indicated in Table I . Fig. 7b shows subgraph G 1 of the unfolded graph. For clarity, the events in G 1 have not been relabeled to have occurrence index 1 -the relabeling was simply an artifact for maintaining consistent notation when proving properties of the algorithm.
The time separation of interest is that from a 3 to a 4 .
Applying algorithm CyclicApproxSep, we find that Phase I converges after 2 iterations when is 6, and 2 (a 3 ; a 4 ) equals 8. With = 9, Phase I converges after 5 iterations and 5 (a 3 ; a 4 ) equals 9. This demonstrates that the number of iterations depend on the edge delay bounds in general.
From the above results and Theorem 2, we conclude that if equals 6, M i is bounded above by 8 for all even i greater than or equal to 4. Similarly, if equals 9, then M i is bounded above by 9. Now, it is easy to see that if the cutsets are shifted by one occurrence index, so that C 1 is fb 3 ; d 3 ; f 2 g and C 2 is fb 5 ; d 5 ; f 4 g, the same analysis applies and 2 (a 4 ; a 5 ) equals 8 2 M i in this notation equals i?1 in the notation of Amon et al. [12] . when is 6, and 5 (a 4 ; a 5 ) equals 9 when is 9. Therefore, M i is bounded above by the same numbers for all odd i greater than or equal to 5. This matches the exact bounds computed by Amon et al.
V. APPLICATION: ANALYZING AN ASYNCHRONOUS CHIP
We now apply the CyclicApproxSep algorithm to verify timing constraints and estimate performance bounds of an asynchronous differential equation solver chip. A known timing bug in a preliminary version of the design, that was detected only after several hours of SPICE simulation, was uncovered in a few seconds by the analysis. Fig. 8 , adapted from Yun et al. [3] , shows the architecture of the chip and the dataflow graph for one iteration of operation. The system iterates through the operations depicted in the dataflow graph until a termination condition is satisfied. In this paper, however, we are interested only in the system behavior as it iterates through the dataflow graph, so the termination condition is ignored. This, of course, implies that the computed re- sults do not apply to the final iteration when the system chooses to terminate the computation. The behavior in the final iteration must be analyzed separately, and is not addressed in this paper. Details of the Di Eq design are described in Yun et al.'s paper [3] .
A. Chip Overview
The control logic of Di Eq is implemented using four distributed extended burst-mode (XBM) controllers [35] . These are labeled ALU1 CTRL, ALU2 CTRL, MUL1 CTRL and MUL2 CTRL in Fig. 8b . If we ignore the checks for the termination condition in their state transition diagrams, each controller exhibits choice-free behavior [3] . The controllers communicate with each other by means of a handshaking protocol that, unlike speed-independent or delay-insensitive designs, assumes safe timing bounds in signaling. The datapath is structurally identical to those found in synchronous designs, except for the completion-reporting logic associated with the ALU1, ALU2, MUL1 and MUL2 units. As in synchronous systems, the control is responsible for generating ALU opcodes, MUX selects, register load enables, and precharge/evaluate signals. Unlike conventional asynchronous circuits, there are implicit assumptions on the timing of datapath signals, which must be satisfied for the system to function correctly. For example, data inputs to domino circuits must stabilize before evaluation begins.
B. Modeling controller timing
We view each controller as a "black box" that receives input stimuli from its environment and responds by asserting values on its output ports after some internal delay. This model suffices for performance analysis, and datapath and inter-controller protocol timing verification. Other timing constraints required for correct operation of individual XBM controllers can be obtained by the technique described by Chakraborty et al. [33] and are assumed to be satisfied. As an example, Fig. 9a shows the XBM state transition diagram of the MUL2 controller in Di Eq. A signal name with a + indicates a rising transition, a name with a ? indicates a falling transition, while one with a denotes a directed don't care. Details of the semantics of these transition types are described in Yun's thesis [35] . Fig. 9b shows how each state transition is modeled using timing constraint graphs. Note that A1M is a directed don't care during the state transition S 1 ! S 2 , so by XBM semantics, the controller need not wait for a transition on A1M to change state from S 1 to S 2 . This is reflected in Fig. 9b by the absence of an edge from a transition on A1M to m 2 . However, A1M has a terminating falling transition during the next state transition S 2 ! S 0 , so the controller must wait for A1M to fall before changing states from S 2 to S 0 . This is represented by drawing an edge from A1M? to m3 in Fig. 9b . in the graph fragment modeling S 2 ! S 0 . This is an intercontroller synchronization event, generated by ALU2 CTRL and communicated to MUL2 CTRL. To model this communication, the corresponding events in the timing constraint graphs for ALU2 CTRL and MUL2 CTRL are connected by an edge labeled with the signal propagation delay from ALU2 CTRL to MUL2 CTRL. Continuing this process, the timing constraint graph fragments of the different controllers can be stitched together to obtain one larger graph modeling the behavior of the four interacting controllers during one iteration of Di Eq.
The above discussion concentrated on modeling choice-free XBM controllers. In general, timing constraint graphs can also be used to model other choice-free controllers if the temporal dependencies between all events can be expressed using min and max type constraints. Each input and output signal transition is represented by distinct events and if the output transition occurs only after all the inputs have transitioned, a max type constraint is used to connect them. This is the case, for example, in event-rule systems [20] , signal graphs [24] , choice-free XBM controllers, etc. If the output transitions as soon as one of the inputs has transitioned (e.g., in some extended event-rule systems [21] ), a min type constraint is used instead.
C. Modeling datapath timing
This section describes modeling techniques for two important datapath components. The behavior of other datapath components can be modeled similarly. The signal opcode at the output of the NOR gate in Fig. 10a falls as soon as one of its inputs rises. Therefore, the falling of opcode is modeled by a min type event (circle in Fig. 10b ).
C.1 Arithmetic circuits designed in domino logic
This demonstrates the need for min type constraints when modeling datapath components at a low level of abstraction. Events Op1Stable and Op2Stable represent the stabilization of data operands, whereas InputsStable represents the stabilization of all inputs (both data operands and opcode). Evaluation begins as soon as A2Prech is de-asserted (A2Prech?). Assuming that InputsStable occurs before A2Prech? (this is checked during timing verification), the circuit computes the sum/difference of the operands and raises A2Done after some data-dependent computation delay. This is modeled as shown in Fig. 10b . The event InputsChanged in Fig. 10b represents a change in the value of one of the inputs after evaluation.
C.2 Edge-triggered registers Fig. 11a shows register A, a positive edge-triggered register, clocked by the signal LA, in Di Eq (see Fig. 8b ). For simplicity, let us assume that there are no asynchronous set/reset inputs. If the data inputs remain stable within the setup and hold-time window around the rising edge of LA (this is checked during timing verification), data is loaded when LA rises, and appears at the output port after a "clock-to-Q" delay. This is modeled as shown in Fig. 11b . The event DataOutValid represents the appearance of latched data at the output port of the register. DataInStable and DataInChanged represent the stabilization of the data inputs prior to latching, and their subsequent change. The event LA? represents the falling of LA after the latching LA+ transition.
If a level-sensitive latch with the high clock phase as the active period was used in place of the edge-triggered register, DataOutValid would be a max type event with two incoming edges -one from DataInStable labeled with the "data-to-Q" delay, and the other from LA+ labeled with the "clock-to-Q" delay. This, of course, assumes that the setup and hold times are satisfied, which can be checked by computing the time separations from DataInStable and DataInChanged to the latching edge LA?.
D. Formulating performance metrics and timing constraints
In this section, performance metrics and timing constraints for correct operation of Di Eq are formulated as time separations between events. The objective is to apply the CyclicApproxSep algorithm to verify all timing constraints and obtain bounds on performance metrics of the system. Note that since the CyclicApproxSep algorithm is conservative in the worst case, false timing violations may be reported and the estimated performance bounds may be pessimistic in the worst case.
D.1 Performance metrics
We concentrate on two performance metrics: Loop delay: The delay from the start of the first operation in an iteration of the dataflow graph to the end of the last operation in the same iteration.
Cycle time: The delay between similar events in successive iterations of the dataflow graph. Since the dataflow graph has multiple threads of computation which do not necessarily start or end simultaneously (see Fig. 8a ), the loop delay is not the same as the cycle time. The inverse of the cycle time gives the rate at which successive values of intermediate variables in the computation are generated by the system. However, if the values of all variables at the end of n iterations are needed, then the total time required is ((n ? 1) cycle time + loop delay).
To compute the loop delay, the timing constraint graph for a generic iteration (as opposed to a specific iteration, such as the first iteration) of the dataflow graph is constructed. Since there are multiple parallel threads of computation, this timing constraint graph has multiple source events. The starting time of an iteration is represented by taking the minimum of the times of occurrence of all source events. A min type event, StartOfIter, is, therefore, added to the timing constraint graph. Similarly, the end of computation in an iteration is represented by taking the maximum of the ending times of all threads. In this case, a max type event, EndOfIter, is added. Bounds on the time separation between StartOfIter and EndOfIter give the best and worst-case loop delays of the dataflow graph. Note that both min and max type constraints are needed to model the loop delay.
To estimate the cycle-time, the time separation between the start of two consecutive iterations of the dataflow graph is computed. Two generic iterations of the dataflow graph are considered. The start of one iteration has already been modeled above. The start of the next iteration is modeled similarly by taking the minimum start times of the threads in the next iteration. Another min-type event, called StartOfNextIter, is therefore added to the timing constraint graph. Bounds on the time separation between StartofIter and StartOfNextIter then give estimates of the best case and worst case cycle time of the system. Note that "cycle time" is sometimes used to denote the average time for n iterations. If such a metric is desired, one must construct the timing constraint graph for n iterations and determine the time separation between similar events spaced n iterations apart. This separation divided by n then gives the average cycle time.
The reader may argue that bounds on the loop delay or cycle time are not representative performance metrics; instead, average loop delay and cycle-time are. However, we feel that finding bounds on these metrics is useful because it allows one to investigate how variations in component delays affect the best case and worst case system performance. This is important if one is interested in performance guarantees, such as an upper bound on the delay for n iterations or the minimum rate at which results are computed.
D.2 Timing constraints
Timing constraints in the Di Eq datapath consist of setup, hold and minimum clock pulse-width constraints of registers, and setup-time constraints of domino circuits. In order to check register timing constraints, we examine every instance of a register loading data during a generic iteration of the dataflow graph. Setup-time violations are checked by computing the minimum time separation from the stabilization of the data inputs to the latching edge of the clock signal (DataInStable to LA+ in Fig. 11b ). If this separation exceeds the setup-time of the register, no setup-time violation can occur. Similarly, register holdtime constraints are checked by computing the minimum time separation from the latching edge of the clock to the subsequent change of the input data (LA+ to DataInChanged in Fig. 11b) .
Finally, the minimum clock pulse-width is obtained by determining the minimum time separation from the latching edge of the clock to the subsequent transition on the same signal (LA+ to LA? in Fig. 11b ).
To verify the timing of a domino circuit, we consider every instance of the circuit performing some computation during a generic iteration of the dataflow graph. We determine the mini- mum time separation between the stabilization of all inputs and the start of the evaluation phase (InputsStable to A2Prech? in Fig. 10b) . A negative value of this separation indicates that the data inputs can potentially change after evaluation begins. This can lead to accidental discharge of the internal nodes, resulting in incorrect outputs. Note that if the input data lines are guaranteed to rise monotonically, the internal nodes cannot be accidentally discharged even if the inputs stabilize during the evaluation phase. However, the monotonic rising constraint is often too restrictive; hence ensuring that the data inputs stabilize before evaluation begins is a robust way of ensuring the correct operation of the circuit.
Finally, we consider verifying timing constraints in the intercontroller communication protocol. In Di Eq, these constraints arise from XBM requirements. For example, in the state transition diagram of MUL2 CTRL (Fig. 9a) This gives rise to the constraints that the minimum time separation from each of M2A2+ and M2prech+ to A2M? must be at least 0. Other timing constraints in the inter-controller communication protocol of Di Eq are of a similar nature, i.e., they are needed to ensure that an event always occurs some time after another event. Such constraints can be formulated in terms of time separations between events, as explained above. Note, however, that with non-XBM controllers, inter-controller protocol constraints may be complex, and it might not be possible to represent them simply as time separations between events.
E. Experimental Results
We have manually constructed a cyclic timing constraint Since algorithm AcyclicApproxSep, used in both phases of the analysis, produces conservative results in the worst case, bounds computed in Phases I and II may be inexact. Therefore, for comparison, the analysis with 5% delay variations is repeated with
McMillan and Dill's branch-and-bound algorithm [8] (an exact algorithm for acyclic timing constraint graphs) used in place of the AcyclicApproxSep algorithm. All the bounds are found to be identical to those obtained with AcyclicApproxSep. However, the complete analysis (Phases I and II) using McMillan and Dill's algorithm takes more than 11 hours on the same computing platform.
We present two subsets of our timing verification results, corresponding to 5% and 10% variations of delays, in Table II .
Each entry lists a pair of events, the required minimum separation between these events, and the computed bounds on the separation during Phases I and II (lower and upper bounds shown are the minimum of the lower bounds and the maximum of the upper bounds obtained in the two phases). Experiments have also been performed to determine how the correctness of the design depends on the delay of the A1M signal (an inter-controller synchronization signal) in Fig. 8b . This was motivated by feedback from the designers, who had observed by means of extensive SPICE simulations that in a preliminary version of the design, where the delay of the buffer driving A1M
was small, the circuit could potentially malfunction. To investigate this, experiments were performed in which the delay of the buffer driving the A1M signal was deliberately reduced. The computed bounds immediately indicated potential setup-time violations of domino circuits in the system. The entries marked with a * in Table II show some of these results. Note that the current analysis uncovered the potential errors within a few seconds, whereas it took several hours of SPICE simulations to detect the same problems. This demonstrates the practical utility of the approach.
VI. DRAWBACKS AND LIMITATIONS
Despite its advantages, there are certain drawbacks and limitations of the proposed method. These are summarized below.
Currently, timing constraint graphs are manually constructed from other descriptions (e.g. RTL) of the system behavior. For large and complex systems such as Di Eq, this is a tedious and error-prone task. Additional research on automating (at least partially) the generation of timing constraint graphs from standard system descriptions needs to be done in order to facilitate the widespread use of the proposed method. Note, however, that constructing a timing constraint graph from a standard system description is expected to be considerably easier, in general, than verifying complex timing constraints between events.
Choice is an important feature of many practical asynchronous systems. Since tightly-coupled systems cannot model choice, this is a severe limitation of the current work.
Even for systems without choice, property P4 in the definition of tightly-coupled systems (Definition 1) restricts the types of systems that can be analyzed to those with few min and mostly max constraints. Yet another restriction is that the delays of all components, both in the system and its environment, must be finitely bounded. The second restriction makes it difficult to model interface circuits, where, for example, a system might interact with an environment with unknown delay characteristics. However, both property P4 and the finite edge delays are needed to ensure that finite bounds on the long-term time separation of events can be obtained in pseudo-polynomial time even without any knowledge of the initial system behavior. This may be seen from the proofs of Theorem 1 and Corollary 1, which, in turn, are needed to prove Lemma2 and Theorem 2. Systems satisfying properties P1 through P3, but not P4, or systems with infinite edge delay bounds may still be analyzed using the proposed algorithm. However, there is no guarantee that the computed bounds will be finite. Since infinite bounds are not very useful in general, the usefulness of analyzing such systems with the proposed algorithm is unclear. Note, however, that the computed bounds are conservative regardless of whether property P4 is satisfied or whether the edges have finite delays.
Although the iterative refinement in Phase I of algorithm
CyclicApproxSep can be terminated after a pre-set number of iterations, it is usually desirable to let the analysis converge since that gives tighter bounds by the monotonic convergence property. A drawback of the current method is that for pathological cases, the number of iterations required to converge may depend on edge delay bounds, and hence, can be large. Each event in a timing constraint graph is associated with either a min or a max type constraint, but not both. However, in certain systems, the time of occurrence of an event depends on the min of the times of some of its predecessors and on the max of the times of some other predecessors. In addition, linear constraints may exist between the time of the current event and those of other events. Unfortunately, systems with multiple types of constraints associated with the same event cannot be modeled or analyzed with the proposed technique.
VII. CONCLUSION
This paper presented a pseudo-polynomial time algorithm for computing conservative bounds on the time separation of events in a class of choice-free asynchronous systems with min and max type timing constraints. The systems considered are more general than those considered earlier by Amon et al. [16] , Hulgaard et al [17] and Myers et al. [15] . Although the algorithm is based on an idea similar to Myers and Meng's algorithm [15] , there are several important extensions to and differences from their work. We believe that these represent important contributions to the existing body of work on timing analysis of asynchronous systems with min and max timing constraints. A complete asynchronous differential equation solver chip has been modeled and analyzed using the proposed technique, demonstrating the practicality of the approach. Although the analysis is conservative in the worst-case, our experiments (Di Eq and other examples adapted from the literature [12] , [13] , [10] , not all of which could be described in the paper due to lack of space) yielded exact results. This suggests that carefully designed efficient algorithms for approximate timing analysis of asynchronous systems can be a promising alternative to more expensive exact techniques. Such techniques can be extremely useful in design-analyze-redesign environments, or may even be used as fast filters to narrow down the set of potential timing problems in large and complex designs. We believe that such techniques will play an important role in helping asynchronous circuits gain wider acceptance in niche application domains.
APPENDIX

Proof of Theorem 1:
Proof: Let there be k events in each of C 1 and C 2 . Without loss of generality, the events in C 1 are assigned topological indices 0 through k ? 1, and those in C 2 are assigned topological indices in n ? k through n ? 1. The theorem is proved by showing that (i; j) is bounded above by L(j) ? l(i) for all i and j in n ? k through n ? 1.
Let i be an integer in the above range. The proof consists of three parts. First, we prove the bound for (i; j) when j lies in 0 through k ?1. Next, we consider (i; j) with j in the range k through i ? 1. Finally, we consider (i; j) when j lies in i + 1 through n ? 1. Note that in the first two cases, i is greater than j, so (i; j) is computed in the i th iteration of the outermost for loop of algorithm AcyclicApproxSep (see Fig. 3 ). However, in the third case, j is greater than i, so (i; j) is computed in the j th iteration of the outermost loop. Part I: By property P4 in Definition 1, there exists a path P from event j to event i such that all events along P have max type constraints. Let v be the predecessor of i along P. It follows from step (i) of function ComputeInitialEstimates (see (1) However, v itself lies on path P, so it is a max type vertex, and index j is less than index v. Let u be the predecessor of v along P. Inequality (5) follows from the hypothesis, and inequality (6) follows from the observation that the longest path length to j is no smaller than the longest path length to any predecessor of j plus the maximum delay of the corresponding edge to j. Part III: Finally, consider (i; j), with j lying in the range i+ 1 through n ? 1. The proof is by complete induction on j.
Basis: Follows from parts (I) and (II).
Hypothesis: Let j be an integer in the range i+1 through n?1.
For all r in 0 through j ? 1, let (i; r) be bounded above by L(r) ? l(i). Using an argument similar to that used in the induction step of part (II), the value of (i; j) is seen to be bounded above by L(j) ? l(i).
Proof of Lemma 1:
Proof: We start with the following observation about algorithm AcyclciApproxSep. 
Proof of Lemma 2:
Proof: The constant term, 2, accounts for the first iteration and the final iteration in which convergence is detected. The lemma is proved by showing that there can be at most (r ? 1) From Theorem 1, we know that for every pair of distinct events u and v in C 2 , the value of (u; v) is bounded above by L(v) ? l(u) after the first iteration. Lemma 1 implies that in every subsequent iteration, the value of (u; v) either remains unchanged or decreases. The minimum value to which it can decrease is ? (v; u), which is bounded below by l(v) ? L(u). Therefore, the range of variation of (u; v) is (L(v) ? l(v)) + (L(u) ? l(u)). The same range applies to the variation of (v; u) as well.
Since all edge delay bounds are integral, (u; v) and (v; u)
can decrease only by integral amounts. Therefore, after the first iteration, there can be at most (L(v) ? l(v)) + (L(u) ? l(u))
iterations in which the values of (u; v) and (v; u) decrease.
Summing this over all pairs of distinct events in C 2 gives (r ? 
Proof of Theorem 2:
Proof: If at least two iterations are made, Corollary 1 and step 2(b)(iii) of Phase I of algorithm CyclicApproxSep (see Fig. 4) imply that the bounds between every pair of source events in C 1 are finite in the final iteration. Since all edge delay bounds are also finite, it is easy to see from the steps in algorithm
